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European Patent Application 

The present invention relates to the subject matter claimed 
and hfiinriA rpftaen l:o a nua-bliod and a davloo foe oonqpUing pro- 
grams for a recsonflgurable device* 

KeoonfxTurable devices are well'~knovn. They include ayacol±c 
arrays/ neuronal networks, Uui-tipKocsBsov sysi:«nai ProEessofen 
comprising a plurality of ALU and/or logic cells^ crosabar- 
swltches, as well as FPGAs, DPGAa« XPUT&Rs, asf.. Reference is 
being made to DS 44 16 881 Al« OS 197 81 412 Al, DE 197 81 483 
Al/ DB 196 34 846 Al, DBS 196 S4 593 Al, DE 197 04 044. 6 Al, 
OS 198 8Q 129 Al, DE 198 61 088 Al, OE 199 BO 312 Al, 
PCT/DE 00/01869, DB 100 36 627 Al, DE 100 28 397 Al,. 
DE 101 10 530 Al, DB 101 11 014 Al, PCT/EP 00/10516, 
EP 01 102 674 Al, DB 198 80 128 Al, DE 101 39 170 Al, 
DE 198 09 640 Al, DE 199 26 538.0 Al, DE 100 50 442 Al the 
full dlacloaure of which is' incorporated herein Cor purposes 
of reference. 

rurthennore, reference id being made to devices and methods as 
known from D8 PS 6,311,200; US PS 6 ,298,472; D3 PS 6,288,566; 
US PS 6,282,627; OS PS 6,243,808 Issuad -bo Chani«l«ion9y9tama 
INC, USA noting that the disclosure of the present application 
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1 Introduction 

Thi5 document describes the PACT Veciorizing C Compiler XPP-VC which maps a C subset extended 
by port access functions to PACT'S Native Mapping Language NML. A fucure extension of this compiler 
for a host-XPP hybrid system is described In Section 7.3. 

XPP-*VC uses the public domain SUIF compiler system. For insttdlation instructions on boch SUIF and 
XPP-VC, refer to the scparafiely available inscallatipn notes. 



2 General Approach 

'llie XPF-VC implementation is based on the public domain SUIF compiler framework (cf. 
http r / /suif . Stanford . edu). SUIF was chosen because it is easily extensible. 

SUlPwas extended with two passes: partiblojn and nznlgen. The first pass» partition, tests if 
the program complies with the restrictions of the compiler (cf. Section 3.1) and performs a dependence 
analysis. It determines If a FOR-loop can be vectorised and ainiuiaies Uid syntax tree iiuuurdiiigly. lu 
XPP-VC; vectorization means that loop it^tions are ov^lapped and executed in a pipelined, p^lel 
fosldon. Tiifs technique is based on the Pipetlne Vectorizaiian method deveJopeiJ for recxkofiLgurable 
architecturesJ partition also completely unrolls inner program FOR-Ioops which are annotated by 
(be osen All innermost loops (after unrollinfO which can be vectorized are selected and annotated for 
pipeline synthesis. 

nmX^en generates a control/datallow ffraph for the program as follows. FirKt^ prnigram d;it» is allnmfnri 
on the XPP Core. By default, nml^en maps each program array to internal RAM blocks wliile scalar 
variables are stored in registers within the FAEs. If instructed by a pragma directive (cf. Section 3.2.2), 
arrays are mapped to extmal RAM. If it is laiige »oug}i» an external RAM can hold several arrays. 

Next one ALU is allocated for each operator in the program (after loop unrolling^ if applicable). The 
ALUs are connected according to the data-flow of die program. This data-driven executioD of the op- 
erators automatically yields some instmction-Ievel parallelism within a basic block of Oie program, but 
the basic blocks are nonnally executed in their original, sequential order, controlled by event signals. 
However, for generating more efficient XPP Core configurations, nmlgen generates pipelined opera- 
tor networks for imier program loops which have been annutatea fur vecLuiizaiiuu by partiit-ion. In 
other words, subsequent loop iterations are started before previous iterations have finished. Data packets 
flow contmuously through the operator pipeHoes. "Ry applying piphlinft halnncine rechniqnes, maximum 
throughput is achieved. For many programs, additional peifonnance gains are achieved by the complete 
loop unrolling transformation. Though unrolled loops require more XPP resources because individual 
PABa are allocated for each loop iteration^ they yield more parallelism and better exploitatiDn of the XPP 
Core. 

Finally, nmlgen outputs a self-contained NML file containing a module which implements the program 
on an XPP Core. The XPP IP parameters for the generated NML file are read from a configuration file, 
cf* Section 4. Thus the parameters can be easily changed. Obviously, large programs may produce NML 
files which cannot be placed and routed on a given XPP Core. Later XPP-VC releases will perform a 
temporal parti tlonmg of C programs in order to overcome this limiiaiion, cf. Section 7.1* 

'Cf. M. Wembordt and W. Luk: Pipeline Veciprizatian, IJEHE l^ansactions on Compuier-Aided Design orimegrated Circuits 
and Systems, Feb. 200), pp. 234-24B. 
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3 Language Coverage 

This Section ^scflbes which C files can cumntly be handled by XFP-VC 

3.1 Restrictions 
3.1.1 XPPRestrictiond 

Hxe following C language operations cannot be mapped to an XFP Core at alL They ate not allowed in 
XPP-VC programs and need to be n^apped to the host processor in a codesign cainpiler» cf. Section 7.3. 

m Opmting system calls, including I/O 

• Division, modulo, non-nnnsrsinr Mft. and floating point operations (unless XPP Care's ALU su|>- 
ports them)^ 

• The size of anays mapped to internal RAMs Is Umited by the number and size of ini<»nal RAM 



3.1 j: XPP-VC Compiler Restrictions 

The currant XPP-VC implementation necessitates the following restrictions: 

1. No multi-dimensional constant arrays (due to the SUIF version currently used) 

2. No switch/case statements 

3. No srruGt: data ^ypes 

4. No function calls except the XPP port and pragma Amotions defined in Section 3.2.1. The program 
must only have one function (main). 

5. No pointer operations 

6. No litnaiy calls or recursive calls 

7. No iiregular control flow (break, continue^ goto, label) 

Additionally, there are currently some implementation-dependent restrictions for vectorized loops^ cf, 
the Rftlftaws Notas. The compiler produces an explanatoiy message if an inner loop cannot be pipelined 
despite the absence of dependences- Howevav for many of these cases, simple woikarounds by minor 
program changes flie available. Furthemioie, prograros which are too laige for one configuration cannot 
be handled. They should be split into several configurations and sequenced onto the XPP Core, using 
NMUs reconfiguration commands. Thus will be performed automatically in later releases by temporal 
pamuomiigy cf. Section 7.1. 

^In ftatnre XPP-VC releases, an alternative, scqucittial implementation these operations by NML macros will beavaitoble. 
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3^ XPP' VC C Language Extensions 

We now describe C language extensions used by XPP-VC. In order to use these extensions, the C program 
must contain the following line: 

#lnclude "XPP.h" 

This head^ file, XPRh, defiutes the port fiincdons defined below as well as the pragma fboction 
xPP_unroll ( ) . If XPP.unroll c ) direcdy precedes a FOR loop, it will be completely imiolled 
by partition, cL Section 6.2, 

3»2»1 XPP Fort Functions 

Since the noiraal C I/O functions cannot be used on an XPP Core, a. method to access the XPP I/O units 
in port mode is provided. XPP.h contains the definition of the following two functions: 

XPP-getstream(int ionum, int portnum^ int *value) 

XPP^uL&;Lrticuu{XxiU xonuivir i.nt portnmni* ±nt: value) 

ionum refers to an I/O unit (l».4)y and poxtnuta to the port used in this I/O unit (0 or 1). For the 
duration of (he execution of a program, an I/O unit may only be used either for port accesses or for 
RAM accesses (see below). If an I/O unit is used in port mode, each port num can only be used either 
for read or for writ© accesses during the entire program execution. Jn the acceiss ftinodons, valuo is 
the data received from or written to the stream. Note that XPP^getstream can cunently only xead 
values into scalar variables (not directly into amy elements!), whereas XPP -put stream can handle 
any e)q)res5ions. An example program using these functions is presented jn Section 6.t. 

3«2.2 pragma Pfrectlves 

Arrays can be allocated to external memojry by a compiler directive: 

Example: fpragma extern x 1 maps array tc to external memory bank 1. 
Note the following: 

• <var > must be defined before it is used in the pragma. 

• Bank ^RAMjnunib©r> must be dftrlnrwl fn the file xppvr nptionx, cf. Section 4, 

• If two arrays are allocated to die same external RAM bank, they are arranged in die order of 
appearance ofilieir respective pragma directives. The resulting olTsetai are lecujiletl iuJiieJif, cf. 
Section 5,U 
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4 Directories and Files 

After conect insiallation, the XPPCJ^OOT environment variable is aefined, and the PATH variable 
extended. $XPPCJlOar is the XPP-VC root diiectoiy. $XPPC JlOOT/bin contains all binary files 
and the scripts xppvcrnake and xppgcc. $XPPC_ROOr/doc contains this manual and the file jgf- 
pvcjvleasen<ne5.m, JWrJi is located in the include snbdirectoiy. 

Finally $XPPC-RO0T/lib contains die options S\b xppvc^ptions. If an options file with the same name 
exist in' the current woridng directory or the jds subdirectojy of die user's home directory, diey are used 
(in this order) instrad of the master file in SXPPCLROOT/lib. 



Option 


Explanation 


Default value in 
xpp vcopt Ions 


debug 

version 

pacsize 

xppsize 

busnumber 

iramsize 

bitwiddi 

freg-data-poit 

breg*^ata-port 

freguevent-poit 

bresjeveixLpoi^ 


debug ouiput enabled 
XPP BP version 

number of ALU-PAEs in x and y diteccion 

number of PACs in x and y direction 

TiUTubcr urdaLu zuid tsvcul bu2$t»» pm iuw (bocll dji.s) 

number of words in one internal RAM 

XPP data bit width 

number of FREG data ports 
number of BREG data ports 
number of PREG event ports 
number of BREG event ports 


on 
V2 
6/12 
1/1 
0/6 
236 
32 
3 
3 
4 
4 



Ibbb 1: Opti(ais 



xpffvcj^ptiom sets the compiler options listed in Table 1. Most of them define the XPP IP parameters 
wMch are used hi the generated WML file. Lines siarthig with a # character are comment Hnes. 

Additionally, extrara followed by four integers declares the external RAM banks used for storing ar- 
rays. At m£t fow external RAMs can be used. Bach integer represents the size of the bank declared. 
Ske zero must be used for banks which do not exist The master file contahis the ftdlowmg hne which 
declares f oar 4GB (l a words) external banlui: 

extram 1073741824 1073741824 1073741824 1073741824 

Note that, m Older to simplify progiammmg, xppvc-opt ions does not have to be changed if an I/O 
unit is used for port accesses. However, this memory bank is not available in this case despite bemg 
declared. 



5 Using XPP-VC 
5.1 xppvcmake 

In Older to create an NML file, /Zee is compUed with the command xppvcmake file - nml. ycp- 
pvemake file . xbi n aridiHonally calls xmap- Witfi xppvcipake.XP/lA is automatically searched 

TmXl'I'il ccamp V/.< Public 
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forindirectoiySXPPCJROOT/includc. . , , ^ 

The foIlo>vi„g output produced by translating .he example program streamfirx io Section 6.1 the 
progKuns called by xppvnmak^t 

U^:tZLT:Z:ZonLl^^ -parallel -no P0,««^0.»«»^HOP4 

-.spr streamfir.c s„,.o 
porky -dead-code $treamf ir .spr streamf ir .8pr2 
partition streamf ir . spr2 streamf ir.svo 
Program analysis; 

main: DO-LOOIP, line 9 can be synthesized 

maini can be synthc^ised eoi«pletely 

''^?^:t:.^rn^r™^^ XPO module synthesis. 

ir no 7 OOP line 9 selected for synthesis 
po^y :constT.;p -scalarize .-copy-prop -dead-code streamfir.svo 

p:ed:r--;^"ize str...fir.svol -"«-f --%\xeam..r .sur 
?orky -ivar -know-toounds -fold strean.fir.svo2 streamfir .su 
nmlgen streamfir .sur streamfir .xco 

P3CC is the SUIF f^ntcnd which tn«slates ^.c^- 

UKlicates that the c»*« P'?S««"JJ^'^^" f 337^^^ the mi^st^anfirMnd. The SUIF file 
some additional oprtmixations before ^"^^^'^'^^^^'^^^^^^^3 ^ die generated NML 
./^«myJr^.bgeneratedtoh,spjtar«^^^^ ^^^^^ 

shows the anay to RAM m^^g ^osen *e owopde^ ^^^^^"'^^.^ ^ ^ig infonnatlon 
created by P-^ition and mal^-n rc^^^^^^^^ rcp«.cn.^on. 



5:8 xppgcc 



•j-j A,, «,moarina simulation results obtained with xppvcraake, xmap and 
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cuirent directoiy for n in L.4 and m in 0..1- For instance, the program in Section 6.1 is compiled as 
foUows: 

xppacc "-o streamflr streamfir.c 

The resulting program strean^ir will read input data from portJJOMat and write its results io pori4J).dat.^ 



6 E3mmples 
6A Stream Access 

The followmg program streanifinc is a. small example showing the usage of the XPP^get st j;eam and 
XPP-putstream functions. The Infinite WHILE-Ioop implements a small FIR filter which reads input 
values from port 1.0 and writes output values to port 4.0. The variables xd, xdd and xddd are used to 
Kiozr dclciycd input valuer. The lAiiiipilei automatically generates a shift-icgistci^like conflguratioii for 
these variables. Since no operator depradences exist in the loop, the loop iterations overlap automatically, 
leading to a pipelined FIR filter execution. 

1 #ii\clude '•XPP.h" 
2 

3 mainO { 

4 int X, xd^ xdd, xddd; 
5 

6 X « 0; 

7 xd = 0/ 

5 xdd « 0; 

9 while (1) { 
LO xddd » xdd; 

11 xdd = xd; 

12 xd = x; 

13 XPP^getstreaxnd, 0, &x) ; 

14 XPP_putstream(4, 0, {2*x + 6*xd + 6*xdd + 2*xddd) » 4); 

15 } 

16 ) 

After generating streamfinxbin with the command xppvcmake streamflr. xbin, the following 
command rcad» Uie input file pvrtIJ).dat and writes the simulation ix^ults to xpp^ori4J).dat. 

xsim -run 2000 -inl_0 portl_0.dat -out4_0 xpp_port4_0.dat: 
streamfir.xbin > /dev/nvXi 

xpp4fort4J).dat can now be compared with port4J0.dat generated by compiling the program with 
xppgcc and running it with the same portJJi^dau 

''However, pioerams iccdving initial datafioni or wridng rosuU daw lo external RAMs in xsim cMinot be compazcd to 
air^dy compUed progcanis luupg xppgcc. The results may also differ if a biiwidili other than 32 is used ft» the generated 
NMLIiles- 
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6,2 Array Access 

The following program arrayfinc is an HR filter operating on airays. The first POR-Ioop reads input data 
from port 1 J) into array x, the second loop filKsrs x and writes (he filtered data into array y, and the third 
loop outputs y on port 4.0- 

1 #include '^XPP.h" 

2 #clefine N 256 

3 int xTN], yCNj; 

4 const, int o[4j « { 2, 4^ 4, 2 >; 

5 main {) { 

6 inti L, tmp; 

7 for (i 0; i < N; i++) { 

9 x[l] = tmp; 

10 } 

11 foir (i»0ri<N3;iii} ( 

12 tmp « 0; 

13 XPP_iim7oll() ; 

14 for (3=0; j < 4; { 

15 tmp 4.= c[j]*xri+3-jJ; 



xppvcraake piodaces die following output: 
$ xppvcmake arrayfir.nml 

psrsrs -T/hoTnft/wFama/xppc/include -parallel -no PORKY FORWARD PROP4 
-.spr arrayfir.c 

porky -dead-code arrayfir*spr arrayf ir , spr2 

partition arrayflr ,spr2 array fir.svo 

Program analysis r 

main: FOR-LOOP X, line 7 can be synrhesl zed/vectorized 

mains FOR--LO0P line 14 can be synthesized/unrolled/vectorized 

main: FOn tiOOI? 1, lino 11 can be eynthesized/voctoriz^d 

main: FOR-LOOP i,. line 19 can be syntheaised/vectorized 

maim can be synthesized completely 

Program partitioning s 

Entire program selected for NML module synthesis, 
main: FOR-LOOP i, line 7 selected for pipeline synthosia 
main: FOR-LOOP i, line 11 aelected for pipeline synthesis 
main: FOR-I,oOP i^ line 19 «eleuUeU rox pipeline syntihcaxa 

20a2-J-ij ccomp VI.J Public 



16 
17 
IH 
19 
20 
21 



J 

for (i « 0; i < N-3; i++) 

XPPjputotaroani (4, Or y [i+2 J > ; 



> 

y[i+2] « tmp; 
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. . . unrolling loop j 
porky -const-prop -scalari2se -copy-prop -dead-code array- 
fir, svo arrayfir.svol 

predep -normalize arrayfir.svol arrayfir . svo2 

porky -ivar -know-bounds -fold array fir . svo2 arrayfir.sur 

xunl9en arrayfir.sur arraylli.Auo 

TliBiiiesiiaBca fiompdrLiLloii show Uiat all loops can be vectorized. The dependence analysis diduuL 
find any loop-canicd dependences preventing veciorization. The inner loop in the middle of the program 
is unrolled. The outer loop's body is effectively snbstitoted by the followhag statement: 

yfi-l-2] - c[0]*x[i+3J + c[ll*x[i+2] + ct2]*x[i+l] + c[3]*x[ij; 

Since all remaining loops are innemiost loops, they are selected for pxpeline syntfiesis. Anay reads, 
computations, and away writes overtop. To reduce the number <rf away accesses, the compUer automan- 
caJly jenioves ledundant anay reads. In the middle loop, only x [ i+3 1 is read. For x [ i+2 1 , x [i+lj 
and X ti] , delved versions of [1+3] ai« used, fonning a shift-register. Therefore, each loop itera- 
tion needs only one cycle since one read from x, aU computations, ml one write to y can be executed 
conniirrftntly. 

Finally, ttie following example program fragment is a 2-D edge detecUoo algoritfim. 

/* 3x3 horiz. + vert, edge detection in both directions */ 
£o3r(v-0; v<-VEru»BN 3/ v++> t 
for(h«0; h<wH0RLEN-3; h++) { 

htmp = (pirv+2irhl - plCv][h]) + 

(pl[v+2] [h+2] - plIvl[h+2]) + 
2 * (pl[v+2Hh+l] - pl[v][h+lj); 
if (htmp < 0) 
htmp = - htmp; 

vtn^ = (pl[v][h+2J - plCv][h]) + 

(pl[v+2] Ih+21 -plIv+2]ChJ) + 
2 • (plCv+1] [h+21 - pl[v+l][hl)? 

If (vtmp < 0) 
vtmp "= - vtmp; 

sum — htmp + vtmp; 
if (sum > 255) 

sum — 255; 
p2Cv+l][h+l] e sum; 

1 

} 

As the onqjut of partition shows, both loops can be vectorized. Since only binermosi loops can be 
pipelined, the outer loop is executed sequentially (Note that the line numbers in the program outputs are 
noi obvious since only a program fragment is shown above.) 

2002-1 -11 ccomp VIA 
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partition edge.spr2 edge.svo 

program ^^a^yj^^-. ,2 can toe synthesized/ can be vectorized 

main: c&a be synthesized completely 
P,.o9*a«» xpp module synthesis. 

Entire program f ^^"f ^^^^^^ed fox pipeline synthesis 

ror-X v: is: fi rei::red .or .ynthesis 

ate auiomaacally generated, "''^i^^^^^^di'^^, stateiLts aie implemented using SWAP 
tS:^.^^^^^ SJeZTnot a^ected hy whi.h h^ch .e 
statenKaits take. 

7 Future Compiler Extensions 

XPP-VC. 

7Jl Tcmporol Partitioning 

r YPP <ronff) progwms are pmtiaoned into several confiRoralions 

. By usins the pragma function ^^'^^^f^^^'- XP^oie. Sp^cNMLconBguration commands are 

r^dr?r:^^^«ss» — 

§^t«tal partitions vrili be detemiined antnmattcally. 
13, Pn^nimlVaiisfoinnations 

^Afs^ eftme twociam transformations are useful. In addition 

1 e enable more parallelism or beuer XPP usage. , , „^ 

i.e.enaDiemo y ,k ««n«fta«cdon could be handled by ihtfiii«*iii«crfo«cflI&. 

Furthermore, programs containing more than one fimction coma PB 

7Ji Cndcsign Compfler 

r ^ AMCt r The user uses pragmas to annotate those pro- 



EniPfan«sz©it IS.Jan. 21:19 



18/01/2002 21: 18 +4989-58990118 VORBACH GFT MBH S. 19/7| 



Vectoriziag C Compter XPP-VC ~: — 

selected pirns can be implemented oo the XPR Program parts containing non-njappable operations tousi 
be executed by the host. 

ThA nrncram narts ranning on the host processor ("SW"). and the parts Juaning on the PAE ar- 

geneited/ln the ^^m^^^^^ 
file and then processed by the native C compiler of the host processor. 

Thus the sequential control flow of the C program defines when XPF parts ai« conliguied into the XPP 
Core and executed. 
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Afistrac£ 

The extreme Processing Platform (XPP) technology of* 
fers a unique reconfigurable computing platform supported 
with a set of loots, A C compiler, which integnxtes toth 
new imd efficient compilation techniques and temporal par- 
titioning, is presented Temporal pardtionbig guarantees 
the compilation of programs with unlimited complexity as 
long as the sttpponed C^subset is used A new partitioning 
scheme^ which permits iu nntp large loops of any kind and is 
rteidterconstrabied by loop-dependencies nor nested struc- 
iures^ is also presented. Furthermore, temporal partitioning 
is applied to reduce the configuration time overhead and 
thus can lead to performance gains, Tlie compilation from 
C code to rhe ccnfigurminn data, ready to be downloaded 
onto the XPF, takes seconds for complete examples^ which 
is, asfitr as we know, not leptvdticed by any other recon- 
figurable cvmpuiing icL'ltrtulogy, Tlie compiler rvpresentj 
a step forward, by furnishing a truly "push-button" ap- 
proach, only comparable to ndcroprocessor domains^ and 
thus can spread the use of the XPP technology and deal 
with tinte-to^arket pressures positively. 



1 Introduction 

Many of tddsy appKcacions are characterized by uicen- 
sive data-stream proeessLng and high-perfonnance require- 
ments, ^ucti pertomiaDce is more and more evidenc to not 
be accomplished with today's microprocessor technology. 
Conventional processors (including DSPs) arc geared for 
scquencial processing. MuItt-DSP and laige instnio- 
lion word (VUW) pxocessocs sdll have severe memoiy bot- 
tlonoekSy luck the number of data ports required to support 
multi-ohanneli higb speed data streams, and fail on furnish- 
ing low power consumption solutions. Accelerating spe- 
ciiic functions using appI!catlon-$p8cinc imcgruied ufrcults 
(ASICs) relieves some of the processing burden, adds some 
required features, but limits flexibility and requires <»cpen- 
sive non-recurring engineering {VIBE) cost$ and long design 



crsrcles. High deiBlty fleJd-pfojtrammable sate arrays (FP- 
GAs) eliminate the NRE costs» add flexibility^ but still rc* 
quire long dming optimizations and veriiicatioa cycles and 
low level hardware efforts. Additionally, the fine-grained 
Structure adopted in PPGAs is not suitable to map at the 
algoHdimic levels which is proved by the well-known difB- 
culUes to have a ^osh-butcoA" high-level methoclblogy to 
program these architectures 

New reconfigurable processing units (RPUs) are being 
introduced trying to solve those problems (1). One of the 
new promising archheccures Is die XPP [2]f3]. The XPP 
a coarEe-gmuned» rundme-reeottAgurable, amy par- 
allel suucture. The architecture was designed to facilitate 
programming and to support pipelinings dataflow computa- 
tions, and parallelism trom the instrucbon to the tasK level 
efficiendy. Tfacrofore* this tecbnology is well suited forap- 
pUcadons in mulUmedia, telecommunlcadons, shnulation, 
digital signal processing, and similar stream-based applica- 
tion domains. The XPP architecture also supports dynamic 

E0]f-raeonfiguraUon in a U9er transparent way. In ozvlcr ID 

drasticaUy reduce the dme to program the XPP, and to keep 
the user fipm architecture details, a high-level compile in- 
tegrating temporal partitioning is required. Such a compiler 
Is the main topic of this paper. 

TWs paper is organized as follows. The next section 
introduces briefly die XPP technology. Section 3 outlines 
compiladon to die XPP and secdon 4 desodbes die tempo- 

lal partitioning steps. SenHon 5 shows snirm RxpRnmenral 
results, secdon 6 points out the main differences between 
this and previous works, and finally secdon 7 concludes the 
paper and cnumcj'atiM ui igMlii£j and fuiu&e woik plaiuied. 



2 XPP Technology 

Tlic XPP Cecil nology consists of a rBGonligurabIc com- 
puting platform delivered as a device or an intellectual prop- . 
erly (IP) core, and a complete development tool suite (XDS) 
[2]. An XPP can be used as a coprocessor for CPU and DSP 
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aichiteclurcs. A prior veision ofxho technology has resulted 
in the XPU128-ES 13), a proioiypfi device, which was pro- 
duced in silicon. 

The XPP arehiiectutfe is based on a hierarchical amy 
of coar$e«grain, adaptive computing elements called FW" 
cessing Array Elements (PAEsU and a packet-oriented com- 
mmtcation network. The strength of the XPF tecjioology 
originates fiom ihc combination of array processing wilh 
unique and powerful mn-Unie reconfiguration mechanisonsr 
Different tasks or applications can be configured and mn 
Independently on different parts of the array. Reconfigura- 
tion is txijsgered eoctemaily or evRn by fspRnial ftvrtnf sienaT/; 
originating within the anray, enabling self-raconfiguriog de- 
signs. By utilizing protocols implemented in hardware, data 
and event packwl^ are uiwsd lu piuueaw, giaictaio, docoo^se 
and merge streams of data. 

2-1 Array Structure 

An XPP contains one or several Processing Array Clus- 
ters (PACsh rectangular blacks of PAEs. Fig,l shows 
the siructuie of a typical XPP device. It contains four PAO 
(see top left hand dde). Gaoh PAC is attached to a Config^- 
ration Manager (CM) responsible for writing configuration 
data into ihe configurable objects of the PAC using a ded- 
icated bus. MulU-PAC XPFS contain additional cms for 
configuration data handHng* forming a hiemr^hicat tree of 
CMs. The root CM is called die supervising CM (SCM). 
It has an external Interface (doited arrow originating &om 
ihc SCM in Fig J) which usually connects the SCM to an 
eKtemal configuradoit memory. A CM consists of a state 
machine and internal RAM for configuration caching (see 
top rlgi)t-hand side of Fig.l). 

Horizontal busses cany data and events. They can be 
segmented by configurable switeh-objects, and connected 
to PAEs and special I/O objects at the periphery of ihe de- 
vice. TTie I/O objects can be used for data-sircaming or 
lo access external resources (eg., memories). A column 
of portK to Ihe corresponding Iftaf CM InreifRrl nn Iho 
army* A ClMPdit can be used to send events to the CM 
from die array. The typical PAE shown In FIgJ (botuim 
ut^iiitsi) uuiiLtmis duee objoctsj oao PRBO (forward reg- 
ister), one BREO (backward tegifiter) and one ALU- The 
FREG object is used for verdcal forward routing (with a 
programmable number of register stages), or to perform 
MERGE, SWAP or DEMUX operations (for controlled 
strpjim iTirtnipiilaNons). The BREO object Is used for verti- 
cal backward routing (registered or not), or to perform some 
selected arithmetic operadons (eg., ADD, SUB, SHIFT), 
The DREGs can also be used to perform logical operations 
on events. Each ALU (sec its uitemai structure on the bot- 
tom left-hand side of Fig.I) pcrfofflis common two-input 
fixed-point arithmetical and logical operadons, and com- 



parisons. A MAC (multiply and accumulate) operation can 
be perfomied using the ALU and the BREG objects of one 
PAE In a single clock cycle. 

Anoihr^r standaid PAR nhjftci is the rttcmory object 
which can be used tai FIFO mode or as RAM for lookup 
tables, intenncdiatc results, etc. if such objects are needed 
xhey are located in the lell and/or right columns of PAEs of 
each PAC. However, any PAE object functionality can be 
included in the XPP architecture. 

A set of parameteiizable features can be used to fur- 
nish an XPP that best fits to user and appiicatian denmnds. 
Thfific fftaf tires include: the number of PaCr and their PARs 
number of internal memories, number of I/O ports, number 
of buses, wold bitwidth, cache size, depth of the FIFO to 
configure each object, etc. 
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Figure 1: XFF arciiitectUFC. 



2.2 Packet Handling and Syndhronisation 

PAK ohJecL<s as defined above communtcate via a paefcet- 
oriented networic Two types of packets are sent through the 
anay: data and event packets. Data packets have a uniform 
bicwidth specific to the XPP Core or device. 

In normal operation mode, PAE objects are self- 
synchronizing. An operation Is peifoimed as soon as all 
necessary data input packecs are available. The resulus ore 
forwarded as soon as they are computed and the previous 
results have been consumed. Thus« a signal-flow graph can 
be mapped directly to the ALU objects and daui-sireamscan 
flow through them in a pipelined manner without adding 
specific hardware. 

Event packets are one bit wide. They transmit state in- 
fbrmation which conurols ALU execution and packet gener- 
ation. For instance, they can be used to control the merg- 
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ing of data^sueams or to deliberately discard data packets. 
ThtiSt condidonal computations depending on the results, of 
e&rlier ALU operations are feasible. Events can even trigger 
a aoif-zoconfi^uixitloii of the device as escplalned below. 

Each data or event packet is only forwarded if the pie« 
vious one has akeady been consumed. The comrounication 
system was designed to uansmit one packec on each inter- 
connect per cycle. Hardware protocols ensure that no pack- 
ets are lost, cvm in the case of pipeline stalls or dudng the 
configuradon process. TTiis simplifies application develop^ 
meat considerably. No explidi scheduling of operations 

Is requinid. 

23 Configaration 

The X?P architecture is opdmized for rapid and user- 
transparent configuration. For this putpose, the configuia*- 
lion managers in the CM tree operate independently (with- 
out global synchronizadonX and therefore are able to con- 
fkg^m thftir n>Rj[mciive parts of the array In parallel. Hv 
cry PAE stores locally its cuirent configuration slate, i.e., 
if it Is pan of a configuration or not (states '^configured" 
VI "ftecO* Once a PAH » ccniigurcd, h changes ita Slate 

to *'configured'\ This prevents the respective CM &om re- 
configuiing a PAE which is still in uscw The CM caches 
the coniiguration data in its internal RAM and constantly 
tries to configure the objects used by the next configuradon 
rAqtie^f^ri. Eaeh XPP ol^fect has .a confieuration FIFO 
which stores data of sobsequenft configurations. Once an 
object has been released (state *Tree"), the next configura- 
tion word in ic$ FIFO is loaded immediately. Honce ic io 
possible to reconfigure partially in one doclc cycle. Add!- 
donally, a prefetching mechanism is used. While a config- 
uration is being loaded onto the blt'O of each object, other 
configurations may already be requested and cached in the 
low-level CM&' Internal RAM. Thus, ic does not need to be 
requested all the way from the SCM down to the airay when 
objects become available. While loading a conflgurationy 
itsi PA£d &tati their part of the coxnpuCatloiis m soon as 
they are in state 'Configured". 

Each ALU object has an input event poit that triggers 
tnc selt-ieieasing of Its resources and of all of the objcctv 
connected to it. Such event is successively bioadcasted ac- 
cording lo die interconnecdons. 

Because of its course-grain nanire, an XPP device can 
be configured ra|ridiy. Since only the configuration of those 
army objects actually used is nccessaty, the configumfion 
time depends on the application. 

2A Devvlojirajiient Tools 

The XPP can be programmed by using the MznVe Map- 
ping language (NML) [2), a PACT pioprietaiy suructural 



language widi reconfiguraton primidves. Ic gives the pro- 
grammer direct access to all hardware features. In ^JML, 
configuiadons consist of modules which are specified as 
in a stniciural hardware desnriptJon Isinguage. similar to. 
for instance, sinictural VHDL. PAE objects are explicitly 
allocated, optionally placed, and their connections speci- 
fied. AUdidutisdly) NML iu^luduS MatuiiiuiUv to t»u|j|Aat 
configumdon handling. Thus, configoratfon handUng Is 
an estpUcSt part of <hc NML application pmgraia. XDS 
is an integrated environment for programming with I^CML. 
The main component is the mapp^ xmap which compiles 
NMT..«nnr^ filfts, places and mutas the object*;, and gener- 
ates XPP binary files, xmap uses an enhanced force-based 
placer with short luntimcs. The XPP binaries can cither be 
simulated and visualized cycjto by eyeto with the xaim and 
xvi s tools, or direcdy executed on an XPP device. A high- 
level compiler, described in the next section^ has been added 
to XDH and permits to map c programs onto the AKK 

US Application Execution on XPP 

Reconfiguradon and prefetching requests can be issued 
by any CM in ihe iiUQ (liiuludhtg die SCM wldiilx iran its- 
spond to external requests) and also by event signals gen- 
erated in the anay itself. Rnoning modules can do a self- 
releasing of their resources and requ^t another config- 
uration. Thus, it Is possible to execute an application con- 
sisting of several configurations without any extemal con- 
trol. 

The CM of the XPP pemiits to exploit speculative con- 
figuration', t.e., the configcirauon or a moaulc possibly 
used after the current on© has finished execution. If the path 
which includes that module is taken, the CM only has to 
trigger die execution of the configuration (see tiie section of 
die NML code in Fig.2 and die simulation performed with 
jtslm In |Sig.3. whera nonfJ^OTI?. 1$ s^milativAly ronfifi. 
ured during the execution of conf.MODO). Ifdilspathis ' 
not taken, die CM triggers the releasing of the resources 
altuttdy ixiufiguicd and iequcsis die odicr cosafiguiation. 



3 Compiling C Code with XPP- VC 

Tilts XPP Vwioiiifcing C Compiler XI^-VC Is ba&cd on 
die SUIF compiler firamework (4J. SUIF is used because 
of its easily extensible properties. The XPP-VC compila- 
tion flow is shown in Fig»4. An options file, used by the 
compiler, specifies the parameters of the targeted XPP and 
the external memories connected Co the XPP. To access XPP 
I/O ports specific C-functions are provided. 

(This has Jutniluilieit to epecuj&tlve tt&eouUon. In Uiis Cosb. twfoni 
louywine U a cOnfig\iratio/i will be nquented, its confisunition in stiflncdL 
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CONHG conf.MODO f 
CONF3!ODU]LE<MOD0) // rcqucssi ihc configmation of MOVO 
R)5:QUEST(confj40D2^spcc) // filart speculative configuradon 
//If (MOOO-CMPofia*- "O'*) ilicii cont-MOD2^fio is requested 
OO«l>^CMFORT<MOD0.CMrc>n0, «>nCMOD2.c»:c, 

// if (MODOCMPorU == '^CT) ihco conLMOOl rtq«e$wd 
CONF^CMPORT(MODO,CMPonl, conCMODI, J 

CONFIG cortLMOP2^cc I // lequcsl the coaQguiation Of M0E>2 
C«^FJMOIIi;i.&(MOD2) // bui do nor Stan It 

CONFIC; eonfJVI01>2_exec I flMOVri is taJoen 
SET<M0CKt$i2ut^ = 1) /7 enable the Rtait of computing of MOP2 
RE0UEST(conf_t^0D3) //reque^ the next confi^tirBtlfin 

CONFIG confJMODl I //MODI istakca 
REQXJECTCconfMOIXS^roc) // releasing of jrescuwes of M0D2 
COWrjvtODinLECMOD J) // request the MODI 
REQlflBCT(eoilLMOD3) // icqacst the next oonfignratlon 

CONFIG conf JWODSLree I 
RECONF(MODl.Sinil) // nduuMi die tehOuiC«& of MODI 

1 

Figuie 2: Section of NML desciibing chc control flow. 

The compiler starts with some architBcinre-indcpendenl 
prepiocessing passes based on well-known compilation 
techniques (5J. During this step, FOR luuiw ^i^ auiwiuat. 
icalty unrolled if insinicied by the programmer. Then die 
compiler peribim^ a data-dependence analysis. TTio com- 
piler tries to vectorize loner program FOR-loops- In XPP- 
VC. vectorizatlon means diat loop iterations ace overlapped 
and ex;e^tedl in a pipeliAcd, parallel fashion Thisi rftnHnic jhr 
is based on the Pipeline YectorU^tion method developed for 
reconfigurablc architeccure9 [6]. 

The C piogram can be raaiiuaJly spliui^ in ^jevtaal Jiiud- 
ules by usuig annotations. OihcnvisCt aucomatlc temporal 
partitioning can be applied (sec section 4) in order to famish 
mappable modules and to reduce the overall latency. 

I^OOGen generates one NML module for each temporal 
partition. First, pr03ram datji J* allonarrri on ihft XPP. By 
default, MODGen maps each progmm array to internal or 
external RAM while scalar variables are stored in registers 

wiUilit lliePAEs. Nwit, a contrQl/dataflow graph (CDFC) 5s 

generated. Straight-line code without array accesses can be 
directly mapped to a data-flow graph since the data depen- 
dences are obvious in the DAG repicscntadon. One ALU 
is allocated for each operator In the CDFG. Because of the 
kRlf-.<(ynRhr^ni»ktinn nf cperaioTS on die XPR no explicit 
control or scheduling is needed. The same is true for condi- 
tional execution of such blocks. Both bianches are executed 
10 parallel and MUX opexacors select the comet output (ond 
discard the other one) depending on the condition. This 
data-driven execution of the operators automatically yields 
instruction-level parallelism. In conczast, accesses co the 



2 



tionofMOD2). 



same aiiay have to be controlled expHciily to maintain the 
coirect execution order. MERGE operators (which select 
one input without discarding the other one) route address 
and write data packets In the connect order to die RAM, and 
DF.MIJX operator!! route read data paclcets to the correct 
subsequent operator. State machines for generating the cor- 
rect sequence of event signals (to control these operators) 
are synthezlsed tyy the compiler. For conditional branches, 
containing array accesses or inner loops, DEMUX opera* 
tots controlled by (he IP condition route data packets only 
to the selected branch, and output values are tsdcen ftom the 
branch activated. Thus, only selected branches receive data 
packets and execute^ 

In loopSr all variables updated in the loop body are han- 
dled as follows. The first iteration \xs&& an input packet for 
mc vaxtablc's value» and me subsequcml iterauons use pack- 
ets generated in the previous iteration. In all but the last 
iteration* a DEMUX operator routes the outputs of the loop 
body back to the body Inputs. Only the results of the last 
iteration aio routed to the loop output by the DEMUX opcr<- 
ators. The conttol p«!V«f s for rhe nHMLFX are nenexated by 
the loop counter or the comparator evaluating the exit con- 
dition. Note that the internal operators' outputs cannot just 
be connected to subsequent operators since they produce a 
result in each loop iteration. The required last packet would 
be bidden by a su^am of hitennediate packets. If array ac- 
cesses are present, a loop iteradon may only be started after 
the previous iteradon has tciminated because the original 
nruu^jw nrrier miusf he maimained. Thix iR enfoiced by event 
signals. 

For generating more eincitail XFP conliguiaiiOns, MOD- 
Gen generates pipebncd operator nctworics tor inner pro- 
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FiguiC 4; XPP-VC compilation flow. 



gnnn loopA which have been onnotaCBd for veciorization by 
the preprocessing step. In other words, subsequcnl loop 
iterations arc started before previoiis iterations hflve fin- 
ished. ilata pacicet& tlow continuously through tbe operator 
pipelines. By appiying pipeline balancing techniques, max-* 
imunri throughput is achieved. For nuiny programs, addi- 
tional performance gains are achieved by the conaplete loop 
uDToIling trunsfbrmation. Although unrolled loops require 
usually more XPP rosowrcofi, they yiejd mora paralleligzn 

end better exploitation of the XPP. Tb reduce the number of 
may accesses, the compiler amomadcally removes redun- 
dant amy rcads» When air^y xefeciaiocs inside Jvups 
subsequent element positions die compDer only uses one 
reference and generates delayed stnictuiBs, forming shiil- 
rcgistcrs. 

Pjilcilly, each oiodule generated by MODGen is placed 
and routed aucomadcally by xmap. 

The XPP-VC compiler currently supports a C-subset 
sufficient for programming real applications. Struct data 
types, pointer opcmdons, irregular control flow (broak, 
continue^ goto, label), and recursive and operadng 
system calls are not supported or cannot be mapped to iho 
XPR 



4 l^mpor^I Partitioning 

A program Coo laige to fit in an XPP can be handled by 
spliaing it in several parts (configuraiioos) such thai each 
one is mappable* Ibmporal panitioning permits the auto- 
matic exposing of configurations sucti that the overall exe- 
cution tJmo of tho epplicadon 1$ minimized and is succss- 
fiilly mapped onto the XPP resources. It considers the costs 
to load Into the cache, to configure and to enecute each con- 
figuration with the XPP. An important strategy ttiat is con- 
sidered is to pxe-fetch configurations while another is be- 
ing configured or Is running. Arrays of constants or with 
pre-defined values used in one or more cooSgurations can 
be initialized in parallel with the execution of the previous 
configuration!;. Thi& takes :advantago of the infdallzation of 
the array caxried out by using die configuration Inis. 

Hie set of partitions resulting from die splitting are then 
processed by MODGen, generating a set of configurations. 
Next, specific NML configuration commands aze generated 
which also exploit XFP's sophisticated configuration and 
pre-fetching capabilities^ and specify the configuradon con- 
trol fiow that is orchestrated by the CM. 

4.1 Benefits of Temporal Partitioning 

Temporal partitioning targeting tho XPP con reduce, 
when efficiendy applied, the overall execudon dme. Such 
reduction can be maixdy achieved by the following is- 
sues: (1) reduction Of each pazdcion complexity can reduce 
die Interconnection ddlays (long interconnections may pass 
dirou^h reidsters and thus add clock cycle delays); (2) lo- 
ducdon of the number of references, in the section of the 
program related to each partition, using the same resource^ 
by distributing the ovcmll xofcronaos among parddons, can 
lead to performance gains as wen. This happens with the 
statements presented in the program refeirtng the same ar- 
ray; (3) reducdon of the overall configuration ov^head fay 
overlapping fetching^ configuradon and execution of dis- 
tinct partitions. 

Example: Consider the C example jnax^avg shown 
in Pig.S. Configuradon boundaries arc nGpresented by 
'XPPuMxLjsofrft) statements. They define four configirra 
tions !n the code (see Elg.6), Apart from exposing temporal 
partitions Jn such a way that the mapping to XPP is accom- 
plishedt combining only the most frequently taken condi- 
tional paths in the same partition can reduce the total ex- 
ecution time by substantially reducing the rcconfiguradon 
dme (since the partitions for the odier paths are not config<» 
ured when tfiey are not taken). FJg,(S presents such a case. 
If the path bb^O and bb^l has been idonii£cd as the most 
firequendy executed, this path can be in tbe same partition^. 

'TaU duplieotfon of bl>^ wouJU peumlt to Imve a configurafton with 
(bbJX bb.l. U)l3J and another ooc wiib (bbjs. bhJl; 



9. Z3 




In such a case, tbe oonfiguralioa related to bb^ will only 
be called when the most izcquenc path has not been taken. 

// max^avg exanipXe 
• • » 

ifi(op— 1) { // av6rago kernel 

} 

avoraqe == 3um/Wf 
XPF_aextL-Con£ () ; 
I elsa { // xnax kernel 
XPP_nQxt„oon£ O ; 
max » U; 

ii^(x(l] > znax) max « k|&]; 

} 

XPP.aO>cfe.e<ni£ C ) ; 

) 



Figure S: Example with two conditionally executed kernels 
and with configuration boundaries reprBsented. 




Figure 6: CFG of the algorithm shown in Fig.5. Lines 
crossing edges represent the XPP next c^?^^) statements in 
the code. Bubbles containing basic blocks represent the re- 
gions to be implemented in dilfensnt partitions. 

Since configuration talces many clock cycles, it is in most 
cases preferable to reuse a configuration as ranch long as 
possible In order to reduce the reconflguraiion time over- 
head. Thus, loops in die source code arc always good candi- 
dates to be entiiely ImpIemcAced by a single configuration. 

4.2 Partitioning Loops 

Each loop thatdoes not fit onto the XFP can be dealt widi 
by performing loop distiibution [S] (if applicable) or by par- 
ddonlng the loop and use the CM to oit^hAi^traTe the control 
flow. Currently, loop dlstribudon is not automatically ap- 
plied. Instead, we propose a new method to partidon com- 
plex loops wJOiuuL je^suicliOiis. All Cho loops wliich thclr 
bodies muse be paitidoned are transformed into straight line 
code with a jump to loop-exit or to the next iteradon in or- 
der that each partition can be compiled liy MODGen. FlgJ 




Shows an example of such transfonnation without the state- 
ments needed to communicate the value of scalar variables 
between configurations. Each configuration requests the 
next configuration inhA rakftn (if none. 15; re.qne5;ted than the. 
application terminates and the last configuration releases its 
resources). Depending on the value of the i<N conditxon» 

config. #2 takes two diffcrait wliich itsqucsitift #3 01 #4 
respectively. Since config, #3 always requests #2, at the end 
of its execution, the initiaJ behavior of the loop is preserved. 
The temporal partitioning creates two addititmal configura-* 
tioD bonndaiies to pieserve the initial fiinciionalllty. From 
Hg 7b cjm Vm that configuradon boundaries were in- 
serted before and after the i£ statemenL These boundaries 
are needed since the code before and after will be executed 
once md both the If header and foody will iterate M 1 1 und 
N dmes respectively. 



x<^t i> Int 1; 

i-0| #1 

«o*li=0;i<N;l++) C labis if(±<N> { fra 

atmt:ljr dtmtl; tf2 

»9^eactL.coa£ () ; stnit2 1 #3 

«tni£2» #3 

> goto labl I id 

blihuSj > tn 

si:mt3; «4 

a) b> e) 



Figure 7: Example of the transformation applied for par- 
titioning loops, a) original code addod widi the statement 
representing where the loop is partitioned; b) transformed 
code: c) configuration ID for each statement inb). 

43 Atttomafic Partitioning 

From the SUIF representadon of the C source code the 
temporal partitioning phase constructs an Hierarchical Ihsk 
Graph^ extended. Hi*©-*-. This extended graph has two 
types of nodes: (1) behavioral nodes representing lines 
of roHfl in fhp. inpiil program; (?) amty nn^ax rsftptfRjV.nir- 
ing each array existent in the source code. For instance* 
Fig.8 shows the top level of the HTG+ for an implemen- 
tation of the DOT (Disctotc Cosinc Transform) based on 
matrix multiplicationSr IVpe (J) nodes have three disdnct 
sub-types: (a) block nodes representing basic blocks; (b) 
compound nodes representing if -then-else structjures; 
(c) loop nodes representing the loops (for, white). Loop 
and compound nodes explicitly embody hiemrchical levels. 
Edges in the HTG+ represent data communlcadon between 
two nodes or jost enforce execution's precedence. 

Each bohavlornl node of the RFC 1 is labeled with the 
foHowing information ( some of the labelling steps require 
n^odd tiBK been clioxen» beoausts it also mpostst loop and losk 
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estimation efibns): (J) blade and compound nodesx mtm- 
bcr Of ALUS and khGs; CO loop noder, numDer of iter- 
ations (if unbound, profiling can be used), and number of 
ALUs and REGs: (3) array nodes: the size of thts array, 
lype of the elcmencs, and, when they do exists the initializa- 
don values. Each edge between two bebavioraJ nodes of the 
HTO^ is labeled with tho niunbcr of data words that muai 
be iranaferred between the two nodes. Each edge between 
an anay and a behavioral node in the HTG-i- la labeled with 
the Aumber of load and store references in the source code 
tepresented by the behaviozal node to thac'paitlcttlar amy. 
The estunaled number of times that each load and store ref- 
erence will be executed is also collected The use of die 
san^ anay by dliierent behavioral nodes* incmases die ex- 
ccudon lotcnoy and the nuxnber of lesotnces ftf96ded for dils 
partition^. 

Temppart uses three types of estimations; (1) number 
of XPP resource units needed by the coi^guratloo imple- 
membijg a single or a set of behavior nodes; (2) latency for 
n Imbflvinr nndft or A R&t nf connected behavior nodes On the 
HTG-f- (this does not need to be accurate lo the real exe- 
cution time and only needs to have lelative accuracy); (3) 
nuiiilitst of clock cy&ics to fetch and conligutc each parti- 
tion (calculated based on die number of configuration woids 
needed, which Is computed with the estimation of die le- 
sources needed directly ftom die SUIF rcpx^entation or 
with die number of edges, ALUs, REGs, and pte-defined 
values existent in die NML graph gmnntl nf hy Mnnc;^). 




Figures: Top level ormeHTG+fortiicJDCTetAainpIc; (diis 
top level consists of 4 loops). Circles and boxes icpreseni 
behavioral and array nodes respectively. Data is read from 
an input potc (Loopi ) and written to an output port (]U>op4X 



^E.fl^ UvJce ihv number feferences thn lUimA RAM IcadK to more 
thfiii twice Che number of object!; tcquirfid oa XPP an4 Relays each access 
because of the nlijectt oeeded to MERGli and DEMUX daui and addrcsjf 
padceis. Hcoco. conibiniae several bchovlotal aodes in one piuttdoa incwK 
on overhead which is compated dudns the tcmpoml punlb'orting algorichm. 



The temporal partitioning aJgoridun starts with a parti- 
tion for each node on the fop of the mo-^ and Chen mcigcs 
iterativelly aiyaocnt partitions untiU no performance gains 
are achieved considering die maximum available sbce for 
each partition. Each partition must cunenUy define* on the 
control flow graph (C3FG) of die program, regions of code 
with all cmrics to die same instruction and possibly multiple 
exists- The algoridim considers dxe overlapping of configu- 
ration and execution with fetch during the mergit^ of parti- 
uons. me aigonchm starts wim the granularity of die nodes 
In the HTG-i- and only if a block node caonof be m^ipcd 
it considers psrtitiomng at die statement or ^b-blocfc level. 
Thus, the granularity of die algoriUmi adapts accordlhg to 
the application needs. 

Tho tempoml partitioning sCiatogy only exploito configu 

ration boundaries inside loop bodies if an entire loop cannot 
be mapped to dio XPP or contains more than one inner loop 
in die same level of the loop body. If mese cases occiu; the 
algorithm Is applied hierarchically to the bocly of the loop. 

surp Represemaiian anrioumsd wttn oata-ci^pendenciss 
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i*ig.9 shows the mediodoiogy which uses three levete 
(die computational efforts increase from the flrst to the 
third level): (1) Temporal Partitioning algoridim based on 
the estimation of the needed resources done with func- 
tion coses based on the number and kind of operations in 
ih^ souixtR rndr. The algorithm uses die HTG+ and the 
SUIF representation of die program; (2) For each config- 
uration, selected In the first level, the esthnated sizes are 
Checked wim the ones estimated by gcnerdtirig die NML 
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^aph with MODCcn- If the ^Ize s3iTxms5:es the avaflable 
nssource^ the algorithm lemn level (I), relaxifig the size 
constraint (dimtnuisbing the maximum number of avaU- 
able resources); (3) Check if each configunuion success* 
fully checked in level (2) can be really mapped to the XPP. 
This iftvel uses functions of the mapper, placer and router. 
If the configuration cannot be in^Icmentcd in the XPP, the 
algorithm returns to level (IX once more relascing the size 
cousuuiiit. The sizo constmint is icloxerl hy reducing the 
alpha parameter in each backward iteration (see Fjg-9)- 

After exposing the configurations with TempPart, the 
compile Introduces die statements neeuea to conumuilcate 
scalar variables between partitions (see Fig.lO). Arrays are 
used as intrtr-partition storage for scalar variables too, since 
only RAMs (to which the arrays arc mapped) keep their data 
daring reconfiguration. XempP art also ensures that anays 
used by mure iliau one contigucatioD, or by the samft ^onfig" 
uratiOA loaded more than once onto the XPP, are bound to 
the same memoiy location and such location Is not used by 
other arrays during the liletime of the array variable. Tlie 
assignment of all ajtrays (the initially used in the source 
code plus the ndded ones to communicnte data) to the ln« 
temat memories is done based on the lifetimes of the airays 
detennined by the sequence of confligurations that were pre- 
vioulsy expo5i;d hi the fi^pfoc program- Thiapemiits. In some 
cases, to use less internal memories since diey can be Ume 
shared, among differcmt configtirations. 



xnt 
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a •* t> ' u» 
XH^P^Qxt^eonf t> ; 
d • a/e? 



#1 

a - b • ai 
COnonCO} "» a; #1 
a « eoRontO] ; #2 

b> e) 



Figure 10: fixample illustradng the commumcatiun uf the 
value of a scalar variable between two conAgurations. a) 
aouree code; b> code widi statements inserted to buffer the 
data; c) coofigunition ID for each of the statements in b). 

4.4 Generating the NML Application 

Each partttion SS Input tu MODGi3tn» which gexierates the 
NML structure to be mapped to the XPP, MODGen genep 
atcs, for each exit point existent in each partition, an event 
connected to one of the CM pons available in the XPF 
(the CM can check if an event Is generated and can pro- 
ceed with different configurations based on the value of the 
event). The compiler generates both the NML representa- 
tion of each partition and the NML section specilying the 
control flow ufciiitfiguratlons. Such control flow i& orchm- 
trated by the CM of the XPP dming rtmtime, as has been 



already explained. 

The compiler also generates NML code consideimg the 
pte-fetch (load of a configuration to the cache of the XPP) 
of configurationB- The compiler «m furnish two different 
strategies: (1) request of the pre-felch of all ccnfigurations 
existent in the application in cbe start of the execution; (2) 
request m each cont&guradun ur the pre^fetch of ihA aexu 
The request is done before the start of the configuration 
step for the current configuration* Strategy (I) is used most 
of the dmcs. However, there are cases where using (2) is 
better. In the presence of several nested i£-t:hen-ex&e 
structures with different coniigurarinTi.% for each branch, a 
pre-fetch sequence defined at compile time can introduce 
too much oveitiead. 



5 Experimontol Results 

Tab 1 .<hows some results obtained when compiling a'set 
of benchmarks widi die XPP^VC. Note that none of the 
examples shown was specially coded to exploit more ef- 
iicicritly the architcctaml fe&turas of the XPP (ft.e., pard^ 
lioning and distribution of anays among the intenial mem- 
ories) and (bus the results can be fiirtber improved. An XPP 
Core with a single FAC was used. The IwA cuiuinn repre- 
sents the size of die PAC (number of columns and rows of 
PAEs) nfiiRd for each example. Columns #cfi #PAE, #Lat, 
and l^ax represent the number of configurations, num- 
ber of PAEs used (it is shown the maximum number of 
FAEs of Uitff laxB^st configuration and the total number of 
PAEs virtually needed), overall latency (taken into accoum 
setup, fetching, configuradon, data conununicadM and exe- 
cution), and the maximum nmnber of objects exccuUng por 
cycle respectively. The hist cohunn shows the CPU dme 
(limine a Pentium III @933MH2 with Unux) to compile 
each example (from the source program to the generation 
of the binary configuration file). 

PCXl Is a StcS discrete cosine transfoirn imp1f»nenta- 
don which is based on two matrix muldplitiaiions* The al- 
gorithm uses 5 loops for the muldplications and 2 loops to 
stream I/O data, it is purcUy scquendul <ao unzolUng la 
used). Temporal panitioning improves the overall latency 
nf ncri by J 3% and uses 3 1 PAB& (without parddoning 5 1 
PAEs are used). Thus it can use a smaltarXPP core. DCT2 
uses the PCX kernel of DCTl and traverses an input im- 
age of a pr&*dcfincd size (16x16 Is used). It uspa 7, external 
memories to load/store the Image and 2 Iniemal RAM& for 
intermediate results and to store the coeflicients. The ver- 
sion wiKh 6 conliguratlons was obtained pesifuiumig tempo- 
ral portiUoning. Since the example has two outer loops the 
scheme to parddoning loops was applied (the compiler uses 
one configuration boundary between the two main loops of 
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Khe DCT kennel). With this scheme, a gain of 4% in per- 
fonnance was achieved oshig 30% less PAEs. Chen is a 
poin tor-free version, with ISO lines of C of a DCT impje- 
uiciitAtion used in JPEG. Tomporal pmrtfHnning fiinilshed 
an imjuoved version: 66% m performance using 12% less 
PAEs* The computation and data-communication is per- 
formed in 6W clock cycles, smouifa xvtusssents an imago fil- 
ter ( 16x1 6 image). The two inner loops (3x3 window) were 
annotated to be tmioHed and conducted to an efficient vec- 
torizatioA. An overall speedup of 4 ($.6 considering only 
execution time) over the implementation obtained without 
unruUhig is obtained. Aditionally, 2 le£s PAEs »fA used 
With unroUing. Haar is an iniplcmeotatioa of the forward 
2D Haar wavelet transfbnn. An input image of 16x16 is 
used. A performance gain of 36% Is acliievinl when tempo- 
ral partitioning is applied. FIR is a ID FIR filler with 12 
taps filtering 2048 samples* Even with all the overhead?, 
0.42 samples/cycle is computed (0.87, considering only the 
latency to communicate data co external memories and the 
FIR compntation)* 

Each onD of the examples was compiled in less than 5 
seconds. This reveals that it Is possible to have runtitjoes 
comparable to the ones acfiieveU by s^uflwarc camplfation. 
Performance gains obtained with temporal paititioning are 
showo. Suice we run most examples with small data and 
image sequences, the configuration overhead is significant. 
Note that the cummt methodology does not use neither 
the fUIl potonixnliti99 of th« XPP nor some opdmizations: 
(1) The execution of a paxtldon only starts after the full con- 
figuration of its resources; (?) No pipeliiting between fetch 
and configuration for the same partition has iaecn used; (3) * 
The capacity of the XPP to configure concurrently distinct 
PACs was not used; (4) An arbiliary order for pro-fetching 
of configurations conditionally requested is used (the order 
should be based on the most frequently tal^en path. e.g., de- 
ccrminod by profiling); (5) Tha configuration FIFOs in each 
anray*object weie not used. Hence, the performance results 



can be Airther improved^ 



6 Related Work 

The XPP technology offeis a promising reconfigorabie 
compuung pJatfoim. Being a step Ibtwaid iii d» context of 
reconfiguiable computing, it permits to attack some of the 
well-known deficiencies of related technologies. The fol- 
lowing sub-sections illustrate the most closely related work 
and reveals the most important differences. 

tf.l H!gh*L6vel ComplIatSon 

Tin; wodc on compiling high level descriptions onto le- 
coxifigurable logic has been the focus of many reseoi^hers 
since the first simple attempts (7J, Most of this work targets 
FFGA devices and thus need logic synthesis, even when 
module generators arc included in the compilation flow» as 
is the c«sft with die MARGE fSl compiler. In addidon, such 
approaches also need badtccnd mai^ng> place and iomo» 
which are veiy fime consuming with FPGA technology. 
Bvisii when pre-placcd and pro routed oomponentfi are oseH 
to assist the compilation fiow, the compiladon time is sdll 
in the order of minutes or hours. 

New apivoaches have been used, which laigei xeseaich 
architectuies. One of Oiose-approaches is the Gaxp-C com- 
piler [9]. Alrhrak^h tt is used for a reconfigwarc/software 
architecture, the configuration bit stream generation, based 
on exploitation of instruction-fevei parallelism beyond ba- 
sic bloi;K& and assisted with lost mapping and placcmeot 
tasks permits to target fine-grain reconfigurablc architec- 
tures elifidently with short conqiiiation times. 

As Garp-C and MAKGE, XPP-VC also uses tin? SUIF 
compiler fiont^end. The generation of the hardware ginic- 
riiiR m be mapped to die XPP is assisted with the pipeline 
vectorizadon ideas presented in [6], However, the ^ner- 
ation of the control structure, based on the event packets 
of die XPP ia oompletly new. Since the XPP a coankS 
grained ar^tecture, which ditectly supports aricfametic and 
other operations occuring in high-level languages^ there is 
no need for complex synilte»lii «itd mapping. The conuol 
structure Is also direcdy mtqpped to objects handlhig events. 

&2 BSsh'^l^vd'RmporBiPartttitHiing 

Temporal partitioning at the behavtoral level has been al- 
ready successfully conducted for FPGAs and other type of 
RFUs. The majority of the current approaches try to use a 
minuituiu iiurobcr of configurations by using all the pofml* 
bleRPU size available for each temporal partiiion (see, for 
instancO) [IO]). Such schemes only consider another parti- 
tion after the current one has hllcd the available lesourue^ 
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and aie insensible to Ihe optimizadon Chat must be applied 
to jreduc-e the overall exocution by overlapping the fetching, 
configuration and e^cecution steps. Albeit not considering 
sucb opdxuizutiunii, tLP fonnuladons picscntod by some 
authors [1 1] are uncapable to deal with the complexity of 
many realistic applications. 

One of the first auempts to reduce the coniiguration over- 
head in the context of temporal paititioning has been pre- 
sented in [jf^.] However, the approach uses the simple 
model of splitting the available FPQA resources into two 
parts and performing temporal panitioning using half of the 
total available area ix» Uits constraint. The scheme only 
overlaps configuration with execution of adjoining paiti- 
ixons and does not take into account the pra-fctch steps that 
can be efficiendy used in some RFU architecunes. Fuither- 
more, the approach causes prDblcms, when some resources 
of thfi RP1 1 miiAC he shared by two or more panitions* This 
contradicts the requirement of dlsjoit spaces of the RPU 
used by two adjaccnl temporal partitions. 

The temporal panftioning algoiitlnn used in tiic XPP- VC 
compiler is based on some Ideas presented in f 1 3]. Hie spe- 
cial charactcrisiics of the algorithm to deal with resource- 
sharing during the creation of the pardtions are not used 
and special hctuisdcs have been added to deal with the fetch 
and conilguraiion time of eanfi £isirtirirm. The purposed pai^ 
ddoning of loops was firstly introduced in this paper. The 
scheme can deal with any type of loops, The previous ap- 
pioacnes consider loop disuibuUun wlieii n loop does not 
fit onto the RPU [141. However, loops which cannot be en- 
tirely mapped onto a single configuration and which cannot 
be distributed are not compiled. Our mediod can deal with 
programs with unlimited complexity as long as the sup- 
ported C subset fics iiKfid . fr rtnas not depend on the feasibility 
of ^ Specific compiler transfomaadon. 



7 Conclusions and jPuture Work 

This paper describes the new Vectorising C Con^iler, 
XPP-VC which maps programs in a C-subset extended by 
port access functions to PACTI 's Ai't arcliitecture. Assisted 
with a fast place and route tool, it furnishes a complete 
"push-buuoo*' path firom algoritimaic desoripdons onto XFP 
configuration <^ta widi short connplIatiDn times. 

An innovative temporal partitioning scheme is pre- 
sented. It enables tho mapping of oomptex programc and 
ftimlsbes XPP aplicattons with performance gains by hid- 
ing some of the configuiation time. A new mechaxusm to 
handle partitioning of loops, wlucn svppotts loop cxeeudon 
by the configuration manager of the XPP, is also presented. 
Funhennore^ the compiler generates self-contained config- 
uration data even When several configurations are exposed. 




Ongoing work focuses on tuning the estimation steps to 
assist automatic temporal partidoning and on improving tho 
conJiguration data generated. 

In addition to loop unrolling, hnp m^r^in^, hop dhtri- 
bution and loop tiling will be used to improve loop han- 
dling, Le., enable more parallelism or better XPP usage. A 
future extension of the couipHtsi foi u Iiosl-'XPP hybrid sys- 
tem is planned. The compiler will map suitable program 
parts, especially inner loops, to die XPP. and the zest of the 
program to the host proccsson 
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1 Benefits of Temporal Partitioning 

Ibmporal partitioning can have a distinct and important goal *° f^.P'® 
enable the compilation of alsorithms which the mapping onto the RPU (Ke- 
configurable Ptocesfllng Unit) r^onrces cannot be accompLshed by only one 
configuration. Iter instance, tamporal paxtitioning targeting the XPP [IJ can 
redtice, when «mnirtiifcly Applied, the overall eacecution latency. Such reduction 
can be mainly enabled by the fbllonsving issues: 

* Tcductioxi of the interconnection lengths, by rwfiimTig «arJi design comr 
plexity, can furnish better performance results (long intearcomiBClaons pass 
through registerfs and thus adding dock cycle delays); 

• reduction of each temporal partition complexity can Itself reduce die numr 
ber of refers used for vertical routing; 

m reduction of the number of references, in each temporal partition, ushig 
■ the same resource, by distrihutmg the overall references among temporal 

petftitiozt3| cim furnicOi bettor pGrfocmiuicd x«flulte as wcU. This happens 
with the statementB presented in the program referring the same array; 

m reduction of thft nvArall mnfiguration overhead by overlapping fetching, 
configuraliott and execution of distinct temporal partitfons, 

Tho reduction of the configiu'RAjon nvarhead is due to 3 difitinctS SOUrces of 
overl vping, posdble with the XPP architecture: 

1. I«3iiding of oubeoquant configurations into the cftcfae in parattei with tho 
configuration of the current one; 

2. esceeution of one configiiratimi while the next one is being configured; 

3. execution of one conflguratiott while the next one ia being loaded into the 
cache* 



I 
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2 SPECSPICATION OF CONFIGURATIONS 



(d) can have mor^ infipnrt. if thft mAj>, placB, and route phase could, as most as 
possible, to confine temporally adjacent configurations in distinct locations of 
the XPP (augmenting the concurrency between execution and configuration). 

For reconfiguTAble computing plaiiformD, whcvo tha seeon£^uratlon of tbe 
acray takes aeveral dock cydea, la in most of the cases preferable that a configu- 
ration is reused as mudx thne as possible in order to reduce the reconfiguration 
oveibead. Thus, loops in the source code are always good candidates to be 
entirely Implemented in a single configuration. 

2 Specification of Configurations 

C<]nfigurations can be specified by Xlxe programmer using XFF^n&^^ctn^) 
Statements in tlie source code of a ^ven application. Such statements must 
expose, on the control flow ^aph CCFG) of tile procedure, regions of code with 
an entries to the same instruction and eventually multiple esdsts. The compiler 
exposes the configurations, removes $udi statements firom the SUIF1[2] interme- 

di^to ropreBontation, cliodcs for invalid dpecifications of confignratJnn bniinriaHfivi 

(when tiie statements expose regions with entries to difi^^^nt statement in a 
gion of code, or when code can be contained in more than one r^ion^), insertes 
tlLe code nj&ipuiniible lo the dasa commiiniration between temporal poztitlona, 
and generates both the NML (Native Mapping Language) [8] representation 
of each configuration and the application section Bpedfyinj the control flow 
of configurations. Such control flow is orchestrated by the UM (Configuration 
Manager) of the XPP during runthne. 

Consider a pointer-free version of the <|uantla5er of an h263 impkmentaticm 
[3] wliidx code is shown in Fig. 1. Four XPF_ne3Bt_eanf(} statements were 
Inserted in the code to specify three configurations. The configurations specified 
Are seprRaented in thfr OFG of the lyvamplf* fhrtt nm Ha utmt in Pig. 2. Apart 
from spedQ^g temporal partitions in such a way that the mapping to XPP 
is accomplished, there can be the case that, merging only the mostly taken 
cuudiLluuiftl yinhha ux the ewxs configuration csoa reduce the total execution time 

hy substantially reducing the reconfiguration time (since the partitions for the 
other paths axe not configured when tibey are not taken)* Fig. 2 presents sudi 
a case. If the path bb_U, bb_l and bb_:5 was ideniSiied as the most frequently 
executed, such path can be specified to be in the same configuration^. In such 
a caser the configurations rdated to bb_d and bb_4 will only be called when 
the most frequently path has not been taken. In some examples, paths are only 
executed in '*degug mode^' (as is the ease of the brandi taken when QP evaluates 
to fflki« in thft .QOiirrii codft of Fig. !)• 

After exposmg the configurations, the temporal partitioning phase intro- 
duces the statements needed to communicate scalar variables between two dif- 
ferent configurations (see Fig. 3). Cunrontly, the ocalar vojriabies ora stored in 

^Ibil duplfcaclon could ba applied In some examples. 

^Xail duplication of fab 5 vniuld pemklo to have a confignration with {bb 0, bb 1, bb 2« 
bt>^o>; ^AOTber one with Tbb^4, bfa_a>; mocI mutOmt uuu wlih {bb^O, bbJ5}> *~ 
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2 SPECIFICATION OF COimOUR/mONS 



if CQP) < 

if (Mod© HOD^INTRA 1 1 Koda MDDB.lrlTRA.Q) i /* lotra */ 
q/uiAff [0] = ttmaxCl.izmilzLC254,coBff C0]/8)>; 
for (1 » 1; 1 < H; t 

level « CabsCcoeff [13)) / (2«QP); 

} 

XPP^ext^conf 0 ; 
J- else i /* non intra ♦/ 
XPP^next.conf () ; 
for (i » 0: i < H; { 

level » absCcoeff M)*-QP/2) / C2*QP); 

qcoeffCi] « iiiuinCl27,mmBxC-127a slg&Ccooff Ci]) « lofvel)); 

> 

XPP.aeacb.conf C) ; 

y 

> eXae C 

XPP^ext.conf (); 
/* Uo quantizing.*/ 
for Ci = Oj i < m; < 
qcoeffCi] « co^fCi3; 



Figure 1: C Bourcecodcof the quantization algorithm with configuration bound- 
aries specified. 



> 

XPP.next.cottf (} ; 



} 
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2 SPEdFlCATtON OF CQNFlGXmATIONS 



$tart 




Figure 2: CFQ of the aJgoxithm shown in Fig* I* The Uxies crossing edges 
represent the XPP_ nea±_ c<mf() statements In tibe code. The bubbles conteJning 
basic blocks of the CFG represent the enposed regions of the CFG that- are 
httpfommtMf fn Mfprmt tAmnAfAl partitions. 
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3 MAI^mGAPPIJOJmONSWITHCONFIGXnUTlONBOUm 

m LOOP BODIES ^ 

int commCUt 

CPBf #1 

d « a/e; a = commCOj; conf #2 

d = a/e; coaf #2 

a) M c> 



Figuns 3: EAfiuiaple illustrating the communication o£ the value of » scalar varir- 
able between two configurations, a) uource code; b) source code witii stateoiettts 
inserted to buffer Ihe data; c) conQgurataon ID for each of the statements in b). 

arrays specially inserted in the SUIPl representation of the givan applicayon. 
Those arrays arc mapped to mtemal memories of thia XFP"- The temporal paj:^ 
titioning phase also enssure that arrays used by more than one configuration, 
or by the same configuration loaded more than once to the XPP, are binded to 
the same memory location and such location is not used by other arrays during 
the ]i£Bt!me of the array variable^ 

The astignement of the overall exlstant arrayB (the initially used Isx the source 
ajde more the added ones to communicate data) to the internal memories Is done 
based on tlie lifetimes of the arrays determined the sequence of configurations 
that were pzovioult^ oxpoecd in the application. This permits, in 60m9 to 
reduce the number of internal memories needed fay time sharing^ among different 
configuratioas, some internal memories during the escecutioil of the application 
on the XPF. 

The XPP-VC compiler generates, for eadi exit point existant in each config- 
uration, an event connected to one of the CM ports available hi Hie XPP (the 
CM can check if an e^nt la generated and can proceed with diffiBrent conQgo- 
rations based on the value of tiie event). The generated event has value if 
the path that activates that exit isi taken and '^l'* otherwise. 

3 Mapping Applications with Configuration Bound- 
aries in Loop Bodies 

Cou%uraLiuii buuiidiules Id loop bodleti can be deal by perfoirmlng loop dlafari- 
bution (as long as it can be applied) or by temporal partitioning the loop and 

^Tn this cane, the hitisnuU memorSea are uaed as data buffers for the mantauiBucB of tha 
ocf slnal pxaEtttiD beha^or. 

^Ac the monxunt each configurallcm musi usu a number of array variables, to be ar^signcd 
to the Internal memories of the XPPj less cr equal than tho number of ineemat memories of 
the XFP (the compiler aBBlgna each array iu » UIi^iJuul uxt£utut;y). Huwotcif the cot«l nwuabcr 
of an^B cxist^nc on the overall configurations can suipasa the number of Internal mexnoHed. 
if some mumodes C4ua be shared between COnCaorations due to tlie non-overlap of the lifeUme 
of anay variables. 1^ data atorad In memoYles la mantled aicraaa zeconfigumtiona. 
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3 MAPPmGAPPLICATIOimWlTSCOJmGlJRATm 

IN LOOP BODIES 

inz 1; int i; 

i=0; conf #t 

fO!rCi=0;i<«;i++) { labl: if (i<N) < c<mf #2 

al:a.t«m^Mi'.1 s B'tatemeli'M : cont #2 

XPP^&xt^cQKiO I 3l;at«iieaxt2; conf #3 

Btatement2; i-H-; conf #3 

8t;ateBejil^; > con£ «3 

statementd; co&f #4 

a) b) c) 

Figure 4: Sxarople of the tranHfoniuaiuu appUnl to Voopa with conSgurataon 
boundaries In their bodies, a) original sounse code; b) tranrforroed code; c) 
configuration ID for each statement in b). 

use the CM to orchestrate the control flow. 

Current)^, loqp distribution 111) is not automaticaUy applied''. All the loops 
vith oinfiguration boundaries specified in their bodies axe transfiazmed hito 
ifO ^oto label land loops in order to pemut the NML generation by the XPP- 
VC CQmpQer. Fig. 4 showB an example of such tramformaticHi witlioat the 
statements needed to communicate the value of scalar yariables between coniig'' 
urations. The 3rd column shows the configuration TD nf eswh Rtnt^Mttrnt. Visah 
configuration requests the next oonfiguratlon to be taken (if the exit tdken is to 
the end then only the '■reconf of the configuration is done)» Conf if2 needs a 
condlUuii^ icqutssl meciianism to call conf or oonf ^4 based on the vahie 
of the i<i\r expression. Since conf #3 alw2^ requests^ at the end of its exe* 
cutioiii oonf, #2, the iniUal behavior of the loop is maintained. The temporal 
partitioning task, also creates two more configuration boundaries to preserve the 
initial fimctionalllty. FVom Fig. 4b) can be seen that configuration boundaries 
were inserted before and alter the if statement. Such boundaries are needed 
since the oode before and after will be executed once and both the if header and 
body will iterate N+I and N times respectively. 

The oonfiguratlon boundarfm innnrt^d fn loop bndieR must specify, at the 
scope of tlie loc^ body» the permitted type of regions (already explained). 

Loop distribution (also known as ^loop fidsiott'O be the preferable form to 
implement lcop9, whidb gonoratod NML doco not ontivcly fit in tho cLvalloblo ro 
sources of the XPP. Such transformation can potentially lead to the introduction 
of temporary arrays. Consider the loop shown in Fig. 5 where a configuration 
boundary is specified. The loop can be splitted so that the two statements are 
each one In one loop and the configuration boundary is now outside any loop 
body> However, we need to sca lar expand variable 8 fn order to mantam the 

^The compiler should chedc the loop dlatilbutloa can be appBcd On each temporal par* 
tltlon boundBxy esdutant in loop bodies. 

teconf means that the resouicos used 1^ that eonfiguration axe r^aaed and then can 
be reconfigured. 

JoSo M P Cardoso|$PACT Informatxonstechnologie C>mbJHl, 
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4 EXECUTION STRATEGIES 



a a ... // Dt^itemeiit 1 
...as// stateiaent 2 



• • • 

for(i=Oj KM; i-t-+) -C 

trap_B<i) = ... // sisftteaieftt: 1 

> 
> 

Figure S: Applying loop distribultlon as anoAet: to enable teauporal parti- 
tioning <m loop bodies. 

initial functionaJlity (one array with the size of the number of iterations of tho 
loop must be declared and is used to communicate eadi s value fn each of the 
loop iter&tions}- 



4 Execution Strategies 

To reduce the overallatenQr, efficient esqploitation of the pipelining of the 3 steps 
presently in eadi temporal partition (fetdhing, cxmfigttring, and axtay executicin) 
must be conducted. 

TWO 'teoonf ' modes can be used (die us^ can select one of the modes in the 
options of the XPP-VC compile related to temporal partitioning): 

• "reoonf ' executed by the CM. In this case each configuration <x>mmuni- 
cates with the CM scndins an event* when tlxe completian of execution, 
to request the next configuration. Tliis next configuration starts by exe- 
cuting a "reconP command to an XPP resource of the configuration (that 
command so broodcastod throughout all the resources usod by .the config- 
uration, and so the resources -w^l be released and can be reconfigured by 
the next configuration). When ft configuration can be requested by more 
than one previuiu^ cunnguralluu, sspeclcil i«*ujifi^uiatfOii9 ate Liserted in the 
Temporal Partition Control Plow Graph (TPCFG^) between each source 
and the sink- Sudi special configurations only command the "reconf of 
a resource in the XJf P of the previouB configuration and request the next 
one. This type of 'Reconf does not permit to have overlapping b^ween 
<!Xiecutian and configuration between temporal partitions; 

* ^'recxmP self applied by each configuration. In this case eadi conflguratloo 
at the en d of the executio n broadcasts a **reconP event to all the XPP 

''^The TPCFG b a dli^cted, eventually cyclic^ graph where each node represents a config- 
uration (temporal |>nrtltioa) and each edge bo«w<^ two Jftodea specifies the execuUon flow of 
tbe application through ita temporal partUtonS. Thejfe Is only one edge between two nodes of 
the graph and each node xepre9ent9 a region of tho CFG of the sppBcation. 

Jq2o M P Cardoso,7PACT informationstechnologle GmbH, 
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5 AUTOMATIC TEM PORAL P ASTmOimG 



rcaOurccD portencin^ to ifc. This siod<» do«A nnf: nnnti addition of special 
configurations by the compiler and penni^ that the CM try to con^iire 
the nesct configuration called during the execution of tihe callee temporal 
partttton (vkttu vuly uue configuraidon path pvesontcd). 

The compiler also generates NML code coneidering the pre-fetcb (load of a' 
GonHguration to the cache of the XPP) of cunfiguaiiiunfi. "When the pce^lbfech is 
enabled two sfirat^es can also be automatically used? 

• reqnesc of the pre-fecch of all couQ&uraUuus eAlstant In the application 
the start of the execution (during this pre-fetchlng the flow of configuration 
and execution is done in the ^way it is specified in the application section 
Of the NML ffle); 

* request in each configuration of the pre-fetch of the next. The request is 
done before the start of the configuration step £or the current configura- 
tion. 

The CM of the XFJ^ permits also speculative configuration of a temporal parti- 
tion that can conduct to better pedGumance results even when the Map, Place 
and Route does not tzy to locate temporal partitions in non-overlapping areas 
of the XPP. The strategy tries to configure the partition speculativelly used 
after the configuration of the cutxent one. If the path which includes that con- 
figuration is talren, tho OM only h^s to enable the start of the esxecution of the 
configuration (see the section of the NML code In Fig. 9 and the simulation 
results in Fig. 7, where conf_MOD2 is speculatively configured during the ex- 
tscuUou of conf^MOBO), 'When such path ia not talccn, the CM releasee the 
re^urces already configured and requests the other configuration. 

5 Automatic Temporal Partitioning 

AuLuiuaUc temporal parUtionhig pennits the automatie eaqposing of aonfigucur. 
fcions oriented by two distinct goals: 

• minimum number of configurations: tin's goal can be nrJiicvcd with al- 
gorithms that try to use all the available reconfigurable processing unite 
d uring the assignement of segments of behavioral code to the same oon- 
figurationi 

* minimum overal latency: this goal can be adu'eved by consid^g the costs 
to load Into the cache, to cunfiguru mid lu tukeuule each configuration with 
the XPP array. An important strategy that must be considered is the use 
of pre-fetch of configurations while one of the others is running. Arrays 
of constants or with pre-defined values used in one or more configurations 
can be Initialized in one of the previous configurations If such one exists. 
This takes advantage of the Inidalization of the array carried out by uslxm 
the oonfiguratibn bus* 

Joao M P Cardoso^FAUT Jlnformationstecbnoiogie GmbH, 
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5 AVTOMATIO TEMPORAL PAXnTIONING 



CONFIG conf ^HODO i . ^ 

CONF MODUI.KCMnno) // request the configuration of MQDO 

SET<HODO-Recon£-E« 1) // enable tba self releaB*»g of raaources 
BEOOESKconf J«0D2^spec) // ^p^cttlative conf igoration 
// ir <MODO.<a«PartO ~ ••0«) Vhm cci«i!-KDDa-.exec w jroqLi*i»ft*«4 
// else cQptinud 
' // if (HODO,CMPortl =■ "Q'O then conf JlODl is reqttftfitea 

COWF-CMPOaTtMDDO.CHPortO, cont.H0D2.exeG. J // M0D2 or ccmtiiime 

CDHFlcMPOIlT(HODO-CHPortl, conf JIDDl, J> // MD^l 



CONFIG conf^0D2^&pde { 
C0HF.HaDllLECH0D2) 

> 

CONFIG conf.K0D2.Bxec -C 
SET(MaD2,St5»rt.A. = D 

ReQUBST(cottf.MaD3) 

CONFICt c«nf .JfDDl { 

REQUESTCconf«M0D2.*ec) 
CONP.MODULE(MOD1) 
SETCMODl.Beconf .E « 
REQUESTCconf .M0D3) 

> 

C019FI6 eoBf ^0D2jrec { 
RSC0NF(H0D1. Start) 



// reqnest tha configuration of HQD2 
// H0D2 is taken 

// enable the start of coispnting <^ H0D2 
// onablG tlio 8»1« rolaafl* of roaonrcoa 
// requeat tho nest oc»3f iguration 

// MODI la LaXeu 

// request the releasing of resources 
// reqaeat the KQDl 
i) // enable the self releasing of raaonrcds 
// requeet the next configuration 



// release tdbie reaoorces of HDDl 



Figure 6: Example of a section of NML code describing the speculative config- 
uration concept. 
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Figure 7: Eaeample q£ the overlappiiig axnoris the fefcch, configuration, and 
cution steps of different tanporal pardtionB. 



f)roxn the SUIFl representation of the C source code the temporal partitioning 
phase constructs an extoided HTG (ffierardiicaa T^k Graph)^. Such extended 
graph has 2 types of nodes: 

1. behavioral nodes lepres^^ng Hource lines of code In the input program; 

2. array nodes representing eacti arr^ sxistent in tbe source code. 
T!ype (1) nodes have 3 distinct sub-types; 

1. biodk nodes v^uesenting basic blocks whifeb one-enttry and a sln^e eodt; 

2. (»ni|>ound nodes repretfeaUug if-lJbiea-dse stmcturea; 

3- loop nodes representing tlie loops (for, while, ect.)- tocip and compound 
nodeft explicitly- embody hlorordiical IcvoTq. 

Edges in the HTG+ represent data communication between two nodes or judt 
enforce execution precedence. 

Each behavioral node of the HTG-h Is labeled witih the following luformation 
(some of the labelling steps require esthnalaon efforts): 

• block and compound nodes: number of ALUs and BEGs; 



0Thp mnHnl luuB been chosen, because ft wlU akn» pcnnit to e3q>loit lo^p and task level 
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Figure 6; Ibp Icnrcl of the HTG+ for thd t>CT ea^ample (this top level consists 
of 4 loops). Circles and boxes repreaent behavioral and array nodes respectively. 
Loop 1 reads the data-stream £com an input port to an internal memory and 
Iioop 4 writes the datarstream generated 1^ the DOT code (IJoupse 2 aua S) tnnu 
an internal memory to an output port of the Xt*P. 

• loop node?: number of iterations (unknown if unbound), and number of 
ALUs and SEGa; 

• array nodes: the ^ of .the an:ay».type of the elements, and, when they 
do codst, the initializatioa values. 

Each edge befcw(p.en two behavioral nodes of the HTG+ is labeled with the 
number of data words that must be transfixed between the two nodes. 

Each edge between an array and a behavioral node in the HTGi- is labeled 
with the number of load and store references = A|i] and A(i] = respec- 
tively) in the source code represented by the behavioral node to that particular 
arrso^. The estimated number of thnes that each load and store reference will 
be executed is also collected. Such information is used to calculate the penalty 
when two or more behaviorAJ nndp.^ are merged into the same temporal parti- 
tion. Such penalty is related to the use of the same array by different behavioral 
nodes and adds an overhead to the execution latency of that temporal partition 

and to the number of resourpca ncodcd for lis implomontation. 

jPig, 8 shows the top level of the HTG+ for an implementation of the DOT 
(Discrete Cosine rOransform) based on matrix multiplications. 

The automatic ten]9>oral partitioning phase needs 3 types uT esthnnLluus: 
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Figure 9: Antamafeic Dqmppral PaxfeitionSng methodology* 



m number of XPP teaource units needed by the configuraUon implementing 
a sin^e or a set of behavior nodesi 

• latency for a behavior node or a set of connected behavior nodes on tlie 
HTG+ (this does not need to be act^irate to the real execution time and 
only needs to have relatlveness accuracy); 

• number of dock cycles to fbtdi and configure each temporal partithni 
(calculated based on the number of confiKuration words needed^}. 

The temporal partitioning strategy dues not es^ploic configuration bouxidaries 
inside loop bodies, unless the entire loop cannot be mapped to the XPP. The 
generation of this l;ype of temporal paitiiians never produces better results 

^Oan bo estfmafted by the number Of ed^e^, ALU nodes, REG oodteSy and lure-defised vsiluss 
ed<Sot9nt la the hordwouro e>^Aph genoratad by the XPP-VC eompltor. 
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(at least when the loop behavior is enssured by the CM). The Justification la 
supported by the reutilization of resources, ah^ady configured, addeved when 
thfl enthre loop is implemented by a single configuration. When a loop does not 
fit in tho XPP, the algorithm is applied hiorarchioaly to the body of the loop- 

Hg. 9 shows the methodology. The Strategy works around 3 levdfl (the 
computational elSbrts increase &om the firbt to the third level): 

1« Temporal Partitioning algorithm based on the estimation of the needed 
resources done with fimction costs based on the number and Idnd of oi>* 
erations in tho source code. The algortthm uses the HTG*h dn<jl the SUXF 
representation of the program; 

2. For each configuracionv selected In Uie firasli hsveli the etsiimaied sizes aie 
checked with the ones estimated by generating the NML graph with the 
XPP-VC compiler. If the filae sTirpasses the available resources, the al- 
gorithm rerun level one, relaxing the size constrahit (dhninuishki^ the 
maximum number of available resources); 

3. CJheck if each configuration successfully checked in level 2 can be rcaUy 
mapped to the XPP. This level uses functions of the mapper, placer and 
router. If the oonfiguratipn cannot be implmiented in the XPP, the algo- 
rithm returns to level 1, once more rdaxing the sise constraint. 

Th^ ftempoml parfcitionmg algorithm used is based on the ideas presented^ to 
m. The special characteristics of the algorithm to deal with resource-sharmg 
during tho creation of the temporal partitions have been removed and Special 
heuziAtice have beco added to doal with tho fetch md configuration *iinf% nft^h 

temporal parUtlon. The algorithm tries to overlap configuration and execution 
with fetdi during the selecdon of the HTG+ nodes to temporal partition, 
{describe moce] 

We call the attention of the reader for the fent that the current methodology 
does not use neither tlin liiU potmtlaifoles wf Hie XPP nor some optimh^ilona. 

1. The execution of a given temporal partition only starts after afl the used 
resources have been conngured; 

2. No pipelining between fetch and configuration has been used. The config- 
uration of the XPP reaouTO«t liJi a specific fcemporal piurtition only sUurte 
affcer its configuration words are fetched (loaded to the XPP cache); 

a. No overi<ii>phig un execution between two or more configurations; 

4. The capacity of the XPP technology to configure concurrently distinct 
VAOs (tsadk PAO haa its own CM)? 
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5. An arbitrary order &r fetching of temporal partitions conditionAlIy re- 
quested is used (the fetching order abould be done by the most talcen path 
determined by profiling); 

6. Behavioral nodes exposed in the HTO+ as concurrent nodes are not at the 
moment implemented, by the XPP-VC compUer, with paraUel ejcecution^ 

Urns, we strong believe that there still be poteatial to improve the performance 
results achieved when using XPP-VC. 

7 Related Work 

The XPP technology offers an unique reconiigurable computing plftt^form 5iqj. 
ported by tools that permit to compile algorfthms in C. Bong a step forward 
in iho contoxb of the rcconfigurablc computing It pcmite to attCK^ eomo of the 
wdl defidendeB preMitly in ma^y, if not an, other reconfigurable oompntmg 
tedmolo^es. However, some of the work being done to aygment the potential 
Of such tedhnology has sources in uujiie wurlcii previuu»]y done, 

^Ibmporal partitioning has been already successfiiliy conducted for FPGAs 
and other type of RPUs- The majority of the current approaches try to use a 
minimum number of configurations by using ali the possible RFU size available 
for each t^poral partition (see, fop instance, [4]). Such schemes only consider 
another ten^oral partition after tiie current one has fidfiffled the avaflable re- 
sources and are insensible to the optimization that must be applied to reduce 
the Overall execution by overlapping the fetching, configuration and execution 
otops. Albeit not conBidaring auch opiimiaatione, ILP formulations presented 
by some authors [5} are uncapable to deal with the complexify of many realistic* 
examples. 

One of the first attempts to reduce titts cuuQguiaMiiu uvisrhead in Lhe conieAt 
of temporal partitioning hes been presented in [6]. However, the approach uses 
the dmple model of splitting the avsailable FPGA resources into two parts and 
perfiTrzaing temporal partitionmg using half of the total avaOable area as the size 
constraint. The scheme only overlaps configuration with execution of adjoining 
partitions and does not enter into accotmt to the pre-fetch uteps that can be 
eiBdently used in some HPU architectures, f^irthermore, the approach can 
originate some problems, when some resources of the RPU must l^e shared by 
two or more partitions (eHmf nating tho miuirAnlAnf; of dilunit spaces of the RPU 
used by two a4iacent temporal partitions). 

[I2]prescnts the scheduUng of kernels (sub**tasks) targethig the Morphosys 
ttcciiitecbure. Tl>e/ U90 an cfEcient ae&rch prtmnlng scheme added to on heuris- 
tic that permits to consider firstiy solutions wlidch potentiaily conduct to the 
beat performance r^ults. However, they mainly orient the seaxch to date re-uSd 
among the schedul kernels which is only suitable to i^e of reconflgural>le com- 
puting architectures where no local memories to the BPU are available. The 
scheduler tA& to overlap computing and data transfers and minlnU^^e context 
reloading, wWdi as we can see ftom the examples shown can not always conduct 
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to the overall minimum latency. The scheme needs as input the application flow 
graph (without coxicurrenqy and conditional paths) and the kernel timing. The 
approadi does not consider temporal partitioning and so needs that each kemei 
configuratlan does not esDceed the context memozy size. 
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Abstract 

gesource yfirtualtzation on FPGA devices, achievable 
thus to dynurnlc i^conftQuration capacities, provides 
an attractive solution to save sliia?n area. Architetaural 
synthesis for dynamically reconfisurable FPOA-based 
dtgiral syssems needs to cpnsld^r thu ctusti of rtsduclfs}^ the 
number of temporal partitions (recat^guratiotts), by 
enabling sharing of some functional units in the same 
temporal pariitlon. This paper proposes a novel algoiitfim 
for axitmnated datapath design, from beliavioral input 
descriptions Represented by a dataflow graph), which 
slmultmeousfy performs temporal partitioning and 
sharing of fimcttonal units. The proposed algorithm 
attentpts to minimize both tha number of tun^oral 
partitions and the execution latency qf Ae genet^d 
solution. Temporal partitioning, resource sharing, 
scheduling, and a simple form of allocation and binding 
are all integrated in a stfigle task. The algorithm is based 
on heuristics and on a new concept of construction by 
gradually (infargin$: timing slots. Results show the 
efficiency and effectiveness of the algorism when 
compared to existeta c^proaches. 

1 Introduction 

The availability ot mulli-prograinniable logic devices 
(such is the case of FPGAs - field programmable gate ar- 
rays) with lower reconfiguration times has made possible 
the concept of 'Virtual hardware" [IIP]: the hardware re- 
sources are supposed unlimited and implementations that 
overaize die resources available on the device are resolved 
by temporal pacdtioning. Then, the temporal partidoued 
solution Is executed by time-sharing the device such-that 
tho initial funr.flonali»y ift preserved. This concept prom- 
ises to be an efficient solution to save silicon area [1]- One 
of die applications is the switch among functionalities diat 
have mutual Bxclusivcness on the ccmpOTfll rfnmain, Ki)ch 
as the context-switching between coding/^dccoding 
schemes in communication, video or audio systems. 



Althoush, even the latest commercial FPGAs^ such as 
die Xillnx™ Virtex family [3], do not have mechanisms to 
implement efficiently temporal partitioned functionalities 

and the tUnft of rftnnnfT^riiraHnn orihfi nvnrall FPOA Is rHI! 

quite hij^ die importance of the ^'virtual hardware" con- 
has already been demonstrated with computationally 
eomplex appUcadons [4]. Industrial efforts are under wsty 
to fiuther improve the capability of die devices to handle 
muldple^configumtions storing several on*c3up con- 
figurations and permitting the switch between contexts in 
few nanoseconds (5]. 

Hie viitoallzation of FPGA resources has been consid- 
ered by iieverid uudiui^^ whilvs ilmtlijig wiUi chcuit jjctlisis 
that overside the avajlabJo resources on die device ([6](7], 
just to name a few). From the point of view of the design, 
those approaches work at a much tow-Jevd of absuiivdon, 
without th^ possibility to exploit tradeoff between the 
number of reconfigurations and the resource sharing of 
functional units (HUs), tbr mstance. 't he design automation 
for FPGA-based systems should include temporal parti- 
tioning algorithms able to efficiently exploit die new con- 
cept Tradeoffs among parallelism, communication costs, 
execution and reconfiguration times, and sharing of some 
FUs in the same reconfiguration need to be considered 
during the architectural synthesis phases. 

Sharing of FUs among operations is a technique to re- 
use a single configuration of an FU by more than one op- 
eration of die same type. On the odter hand, temporal par- 
titioning is a technique tailored to reuse the available le- 
sources by diffezent rimnit/; (nnnfiguration.s) with the dme- 
multiplex of the device. The nodes of a given hitmnedlate 
representation (e,g., a dataflow giaph) representing opera- 
tions have to be scheduled in time steps to be executed in 
each temporal partidon (TP). Temporal partitioning must 
preserve the dependencies among nodes (that are already 
(cmpural dtspeiideiicics) such that a node D dependent on 
node A cannot be mapped to a partition executed before 
the partition where node A is mapped. In addition, consid- 
ering sharing FUs during Temporal paniilunluij can coii- 
duct to better overall results (lower number of TPs and 
better performance). 
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Fi^iu la) shows a design flow which integrates tempo- 
ral partiCfoning prior to the high-level synthesis tasks' [ft]. 
The znajoiityp if not ally of the existent approaches utllizea 

the presented flow [9J[10]. Our efforts address architec- 
tural Bynthcsis' integrating temporal partitioning and this 
paper presents a new lejutpuml pauiiiujiin^ ttl^orithin that 
cffixtively takes Into account sharing of FUs, while main" 
eaining a snnal] coraputationol complexity. Besidas, it is 
Sufficiently flexfhie to target diHexeni f?OA devices. 
Figure lb) shows the design flow proposed In du$ paper, 
where temporal partitioning is integrated in dte high-level 
synthesis tasks and is pcrfbcmed simultaneously. 
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Figuro 1. Do&ign flow based on higti-ievet 
synthesis for reconfigurabie computing 
^tdms: a) traditional flow; b) proposed 
flow. 



Example 1. Motivational example. 

Consider the dataflow eraph exhibited in Figure 2 (Exl). It 
consist of 4 additions and 2 multiplications. Suppose that 
each adder uses 1 cell and has a latency of 1 clock cycle, 
each multiplier usas 7 nelTs and haA a latency of 2 clock 
^Cles and the maximum resouices available on the device 
equals 3 cells. The dataflow graph has a critical path la- 
tenoy of 4 cycles and needs 8 cells elvAn tha<Qft FTTr (Uist 
row of Table 1). Figure 2 Shows an optimal soKition (not 
considering the area of multiplexers, registers and control 
unit needed to implement shoring of a specif! e FU) for the 
example with results shown in the second row of Table 1. 
In Figure 2 each gray region identifies operations that are 
mapped to cho sami? FU. Tlits upihnal solution Is achieved 
with only one adder and one multiplier and fits totally on a 
single TP. When not considering sharing of adders^ the op* 
tlnomn result is Shown In the third row of Table 1. This al- 
gorithm proposed in this paper achieves those optimal re- 



' Thoie l9 no disOnolion amott^ thv loems; hig^'lmi syuthesist vrcM* 
teeiwral synthesis and hvhovioniL^ifiesls, , 



suits. The fourdi row of the table shows the solution ob- 
tained when considering a leveling temporal pardtioning 
algorithm that does nQt consider resource shoring of FUs. 
From this example, it can be seen that resource sharing can 
reduce tiie number of reconfigurations and can also reduce 
tlie overall execution latency. There arc also cases whcte 
the critical path latency of the input dataflow gr^h (last 
row) is maintained (second iow)> 




Figure 2. Dataflow graph of the example 
Bxl. 



Table I. Results for Exi. 



Approach 




OTPs 


Execution 
latency 


Resources 
used 


opdmum (sharing of 
Adders and mulciDliersI 




1 


4 


3 


Optimum (sharing of j 
multipliers) 




3 


5 


3 


ASAP (no sharing) III J 

Without Temporal 
Partitioninjt (no sharing) 1 


(4.2) 
(4.2) 


4 


6 
4 


3 
8 



The remainder of this paper is organized as follows. 
Section 2 formulates and explains the problem. The algo* 
rirhni Ts damply explained In secdnn X where the psiairio- 
COde and the overall performed steps are fully elucidated 
through an example. In section 4 experimental results are 
shown and discusser!. Jo SAnrlon 5, ifelared work is de- 
scribed. Finally, in section 6, condxtsions are presented 
and ftircher work is envisaged. 

2 Problem Definition 

Civen a dataflow graph (DFG)« representing a behav- 
iomi descrjpdon, G = (V, E), topologically ordered^ di- 
rected and acyclic, wift |V| nodes, fvi,V2„ „,V|vi) and |E| 
cdges» where each node Vi represents an operarion and 
each edge ^ E represents a dependence between nodes 
Vt and Vj. A dcpcndonoe con be a eimple precedence 
dependence or a transport-dep^dence due to die transport 
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of data between two nodes. The DFG can be obtained flom 
an algorithmic input description. Such pre-processing step 
IS boyond Qto 3copc of this nxtiolo, but the front -end of our 
Java compiler for reconfigujabia computing systems can 
be employed [12]. 

Here wc assume that then; is a componvnc library wUh a 
set of rUs and there is one FU for each type of operation 
in the DFG. ^ represents die set of FUs, ftom'the compo* 
nent iibrao^^ to be jnstantlated by the algonthro. Kmax 
repjresents the resource capacity available on die device, 
R(TCi) returns tfie number of resources utiUzed by the TP Tlj 
and R(Vi) returns the' number of resources utilized by the 
FU instance associated widi v^ N(ik;i) returns a subset of 
nodes of V mapped to Ki, 

Each partition is a non-empty subset of V> whero fbr 
each node exists a map to one and only one FU instance in 
*• f^yO iUcndflcs lh«?TF whw «oile vj is mappisd. Tlie Sf«l 
of die TP;$ is nepresented by: 

where N represents the number of TPs. A gmph G, ten^ 
ral partitioned in N subsets (TPs), Is correct if: 

" = 0 • ^^^^ V| 6 V is mapped to only 

one iV (here we do not consider cloning of operations 
in die DFG); 

If 

- [JW^i) -V:9il die nodes of V are mapped; 

- V-Ui6 p^ RiltO ^ ^«Axi each TP fits hi the rssounses 
available on the device; 

- V 6 E, Jt(Vf) S n(yi): the order of the execution of 
the TPs docs not violate the dependencies among op- 
erations of the DFG (neoessaiy condition to obtain the 
same functionality). 

A correct set of TPs guarantees the same overall behav- 
ior of the original graph (when executed from i to N and 
considering a correct communication mechanism to trans- 
fer data among TPs). However, wo are also interested on 
the minimization of the overall execution latency. The cost 
tiiat reflects tile overall execution latency in a time- 
multiplexed device can be estimated by the equadon (1 ) or 
(2), vdien partial or ftill reconfiguration of die available re- 
sources is considered respectively. CS(So} returns the 
minimum execution Intrainy (niimher of control Steps or 
clock cycles) of die partitioned solution^ CS(7q) refers to 
the minimum execution latency of the TP 7ti (it may in- 
elude Urs cuiiuiiuiiivaiioit costs iind represents the cxccu- 

tion latency of the critical padt of die graph formed by die 



subset of nodes in and the cotre^ondent edges, consid- 
ering that nodes sharing FU instances can exist), d,- and d 
repieseait die number of clock cycles lu iecuuil^um diiv TP 
n{ or all the available resources respectively. 

CStt?) - 5^G?(;r,) f^D, (1) 

CS(S^) - f;C5(^^)+ Nx9 (2) 

The objective of our algorithm is to iiimish a set of 
datapadbs that v/i)I be executed in sequence with a mini- 
mum number of control steps^. Each datapath unit fits on 
the physically available resources. For the sake of mini- 
jiijidiig the nuiubcr of TPs needed^ CAplolting blisuiiig uf 
FUs while doing tempomi partitioning needs to be consid- 
ered by the algorithm. Specifically, our algorithm has to 
output: 

- The set of TPs ( p): each TP identifVing die nodes of 
the DFO assigned to it; 

- The set of instances for each FU used (<^); 

- Each node ot the DtU has to identity a specitic tV (n- 
. stance of ^ implementing the operation. 

From those outputs, it ia straightforward to gonoratc a 
behavioral HDL-RTL (hardware description language at 
the register translbr level) description of each TP control 
unit and a structural HD^-I^TL di»;cr(piion of «ach 
datapath, considering the existence of a HDL description 
for each FU. The configurations can be genemted from 
those netlists using a traditional fpga design flow, 

3 Algorithm Simultaneously Exploiting 
Temporal Partitioning and-Sharing of FUS' 

The algorithm udc<^ m inltlot number of TPa that cao be 
specified by the user. Anodier possibility is to use die 
number of levels of die DFG or die number of TPs utilized 
by aiiy leiitpuial pat liiionhig algorldim wiUiout using shai- 
ing of FUs (e.g., ASAP [I Ij) as die inidal number of TPs, 
The user has to speciiV d;e total number of available re- 
sourcETs on the device, in addition, tbr each FU there exists 
a boolean variable which value indicates if the FU can be 
shared or not (sharing of some FUs may need more re- 
sources than the utilization of seveml FU instances* due to 
the overhead of using auxiliary circuits needed fbr the im- 
plcmentation of the sharing mechanism). 

To a clear description, we show the main steps of the 
algoridim with a connection to Example 1. A brief exposi- 



^ We Assume ihat each contcoVdme step Ibr schedulmg is equal Co ifac 
clock period or the sy.<ttcin. imis, tnere Is no tflBiinction among Um use or 
^ dock cyde, cmurat step or time step* 



tion of the steps perfonned, when considering staring of 
all FUs> is stretched id Figure 3. 
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Figure 3. Algorithm execution through on 
example: a) ASAP and ALAP start times; b> 
The nedes in the critical path identified by 
trie gray region; c), d)p e)» i) ana g) show it- 
eratione of the algorithm. 



The algorithm starts -with Uie following steps: 

1. Compute the set of nodeB child' of each node of die 
X>F<3, 

2. M2ip an FU instance to each operation in the DFG (at 
the moment neidier consider more than one FU fbr tibe 
same operation nor FUs capal>le to iroplemenc more 
than one operation); 

3* Estim^e the area and execution latency of each node in 
the DFG accordhig to the FU chaxacterizatloni existent 
In Che component libraiy, for the target device. This 
step is beyond the scope of this ardcle and from now 
on we will assume that Oiere exists, for each FU, an es- 
timation of the number of lesoumes and of the execu'* 
tion latency^ 



' A node » is child ofa node ^ if Unite exists a path frem \^ to lh» end 
of the DFG dint hiehidos v^i 



4. Pcrfomi ^e ASAP (o^ soon as possible) aiid AlLAP (as 
late as possible) start times for each node in the DFG 
(see Figure 3a)), both uhconsbamed. When tknn^ the 

ALA? schrae» (be algorithm also calculates the ALAP 
level of each node; 

5. Detemiinc the sec of nodes in one 6f the critical paths 
of the DFG (see Figure 3b)); 

6. Create a number of TPs equal io. the Input number 
specified (see the diree TPs inidally created In Figure 
3c)); 

7. Assign each node of the set of nodes in one of the criti- 
cal patlis of the DFG (detemiined in point 5) co a TP by 
ascending level. When the number of TPa ia larger than 
the number of nodes in the critical path, the last TPs 
aie left cmpQr; otherwise the last nodes of the set are 
left unwigntsd (stse die nudes aaisi^nsil lu tsauh TP In 
figure 3c)); 

S. Assign the size (number of resources used) of a node in 
a TP to the OUffCnt sl^e of that TP (see Figure 3c)). 

After the above steps the main kernel of the algorithm 
is executed (see the pseudo-code in Figure 4^ Figure S and 
Figure 6), Some of the most important functions used by 
the alfiorlthm are listed and briefly explained below: 

- V|.ALAP(eveiO' remms the level of Vf considering an 
ALAP tevding scheme; 

- V{.ALAPStanO: remms the ALAP start time of 
7i^addEl(v|): adds the node V| to 71); 

- 7{{*rmEI(Vj): removes V| from Tig 

- 3(;[.sched(v,): returns the number of control steps of the 
critical path considering that V{ is mapped to tq; 

- |^aLdd(Yti): uOd:! a insw TP Iv Un? vuiivul vf TPb 
{rti will be the last TP in the set); 

- p .el At(i); mtums the i"* TP from the set of TPs ( p) j 

JindNodes(i): returns a list of nodes ready to be 
mapped to thai* TP- 

Our algorithm will be progressively constrttCting a 
global solution. On each iteration, the algorithm traverses 
the sequence of the existent *rp$ trying to assign ready 
nodes to each TP. Each TP has an associated maximum 
slot time (MAXcs)* A node ready to be mapped to a TP is 
only really considered for mapping if the resultant execu- 
tion latency of that TP (considermg the mapping) does not 
exceed the coxrespondent MAXcs (Uno IS of Figure 4 and 
lines 2, 21 and 29 of Figure 5). MAXcs of a given TP to is 
equal to the critical path latency of that TP added by a re- 
lax amount: CS(]C|) 4- re1a>&. On each iteration over the TPs 
the relax value is incremented by the great common divi- 
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sor (gcd) among all (h& execution latencies of the opera-* 
tions in die DFG (line 24 of Figure 4). When a node is 
niapped (see function mapNodc in F^guj-o C)jAho crifttcul 

path length of the associated TP is updated (lines 4 and S 
of Figure (5)- 

The algOTfthm considers thsit nudes in vonilguuusf Uri;c: 
Steps mapped to the same TP and with the same operation 
should be bound to (he same FU instance. 

A list ot node& ready for mapping to a current TP is 
used. The list has the nodes sorted by increasing ALAP 
start times (the candidate operation having the lease ALAP 
value will have the highest priority) and, for nodes widi 
the sanne ALAP start time» it ases the ASAP start time as a 
debreak (by ascending or descending order). The list is de- 
termined examining for a given node its predecessors (they 
already must be mapped in TPs before the current TP) and 
thft child set (the nodes child of the node to be mapped 
must be on TPs after the TP under consideration). The in- 
cremental update of die list of the nodes candidate to be 

mapped to the ciurrent TP. tvhen each node is mapped, is 

an option of the algorithm (lines 6 and 7 in Figure 6). 
When such option is disabled the algorithm only tries to do 
updotP when die lit»t id empty. The algoriHim uses a static- 
based approadi in the sense that die ALAP/ASAP values 
are odculated only once and they are no more time up- 
daied. 

1« // begin Dtaln Kemol 

niteet nodsBSched s marked witn the nodea al- 
ready nanoed co TFSf 
int NumrP o Oi relate. = oj 
int step n 9ca(JMl nodes in DFG) r 
wliile CnotMXVrodeasched (NodesSohcd) ) { 

Veotor liotReady = findNodestHuiriTP} i 

idiileCJIiscneady.iaEinptyO) { 

Node Vk = liotReady.rrnFirstO f _ 

Boolean fi& - (3tieBM / 

iat C€ne» » 1ti.sahcd<VK) r 
IMiiC&nm* > (CSCiri)+relax)) && < <«i is 
the Uaofe TP) At (N<rt»> mm 0} II 
rVK.ALAPiev«iC> <Il<Vk}))) i 

cj^SQSchsaii^Bii, vk, fie, cs«w - 
CS(fl:i). nxt iradeBSchcd, update, CS„nHi 
Liotiteadylr Fiffuse 6 
} elee { 

czyroScb0dm»it0, Vk« fit, relax, 9^, 
XTodenSuiitsUt updacet 
tRcady>f Figure S 



2. 

3. 
4. 
5. 
C. 
7- 

d. 

9. 
iO. 
11* 
12. 

14. 



la. 



19. 
20. 

21. 
22. 
23. 
9A 
25. 
2S. 



> 

} 

KunfYP * 0| 

} 

// end main kernel 

Figure 4. Main kernel off the propooed algo- 
rithm. 



1» czyroSoliedCinc Rmcn, tTode vi^ Boolean £it.« SAit 
rolax, w fiifcaafe Kodceeched, Boolean np- 
dace, Inti CSsm, DistReady> { 

if((CS»< (relax^ceri^n 1| (cs(iq,} == O)) 

i.£ only one node V) in { 

ix ( . AZAPScavfe ( ) ^ vji . AZiAP9(9ir« ( ) ) { 
iie({S» - R(Vj)) <o Bmuc) { 

«fc.addBl(vi)i 
NodosSchod i clear (vj) / 
NodeaScned.ee t (vi) ; 
continue xmob as 

} 



a. 

3. 

4, 
5. 
6. 
7. 
B. 
9. 
10. 

11. 
X2. 
13. 
14. 



15. 
16. 
17. 
18. 
19. 
20. 
sa. 
22. 

sa. 

24. 
29. 

ze. 

S7. 

28. 
29- 

30. 

3X. 
39. 



33. 
34. 
35. 

36. 

37. 
38. 
39. 



boolean oanSlMire s cry sharing with a 
node qC the eame type with a path o£ 
Shared FUe with the snrallesc length 
(iftumbey o£ node a r 
i£(can5har0> { 

i£(Cit && ehara producee increase) { 

rmsharo (Vi) p 
} else { 

Int CSnmA » ^.Bched(vi)r 

taspiradeijtitw Vs., N<odesSched« up- 
date« C3n«%,\, LiatReady) f Ifiguce fi 
> else { 

rmShare (v^) i 

eanshare » £aieej 



ifCicanShare &^ £it tiS (CSqcv S (rolax<*> 
CSClCi.)}) II (C8(nO «•» O)) { 

mapModefA^, Vi^ Nodeasched, update, 

Cdam, la^tKeady) i Pldure s 

> 

±C(vi not: mapped fiiAd no FU «^ti» ope:c9- 
tlon type of vj- in th&sTP end ooes ftOC 
fit and thlBTP ia the last TP> { 

create a new TP n^f 

p. add (An) r 

mpHodeiikit vs, Kodeesched, upddCe# 
CS(%i)« Ll8bReady>i ffigviro ^ 
breaic IMP Br 

} 



1. 

a. 

3. 

4. 
S. 
6. 
7. 
8. 



> // end cryroSched 

Figure S. Function tryToSehed. 

map^ode(7F TTut ^ode vi. Bit9ab HodesSched, 
Boolean update, infc CS,!^,,^ Veotor LiatReady) { 

Wfc.eddBl <Vi) ; 
El^CiCOCelicd.oct: (v.) r 

CS(Kii) s C$anr; 

dL£ (u]^ate) 

upDaceAndSortAjAPiX-ietReady, vt) y 
) // end m^pfrode 

Ffgure 6. Function mapMode. 
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A special directed edge between two nodes is used to 
Identify in tfac DFG Chat both nodes share the same f U in- 
stance* A pad) of nodes conneciol by tuidi nl^es idtaiiincs 
the sequence of utilization of the FU instance (from the 
source to die sink). Those edges are added by the afgo- 
nthm to the inidai DFG and provide an etlDoieni way to 
both represent sharing of FU Instances, to detetmxne Uie 
CTcecution latency and to generate die datapath and control 
unit de;scriptions. When a node v„ is bound to an FU in- 
stance that is already shared by two or more operatiotu^ 
die ateontfani adds a special edge from the last node of the 
path of shared to Vq, However, when there are.one or 
more nodes in that path that are child of v„ in the X>FO, 
is inserted immediately before the first child found travers- 
ing die pad) fh)m left to right (see figure 7). 




Note mat node k 
is a child of 
node n in (he 
DfG 



Indicates lAat me 
soufCftttfidBand 
die sink node chans 
Out sum? flmctioiial 

unit in (he Older cf 
ih« airaw. 



Figure 7. Special mapping of a node In a TP. 

The algorithm considers the intcrchan^<?^ ufa nude pi«s« 
^onsly assigned to a TP with a node ready to be mapped 
in that TP. This occurs if the TP bas only one nodc» the 
node to be mapped lias lower alap stan thne aod the 
change is ieasible In terms of available resources and^ 
MAXcs (lines iirom 3 to 13 of Figure S). 

After die execinion of the main Icemel the algorithm 
considej^ a merge operation. Such operation tries to group 
adjacent TPs in a single TP considering resource sharing 
among FU instances of the same type in two considered 
TPs (see Figure 3e), f) and g)). This step of the algoridtm 
is done incrementally and it stops when no mQtsiinR is pos- 
sible. From the two first TPs of Figure 3e) it can be seen 
that sharing die FUs, the two TPs cm be merged Figure 
30 lUusnates the result of thnr mer^e. From the later result 
'We can figured out that it^ and ih are also able to be 
merged (with node 5 sharing the FU instance already 
shared by nodes 0, 1 and 4). Tho lesultanjt mei^e is shown 
in Figure 3g) and the final solution requires only a single 
TP. 

When a nuUtf dotss nui fit (iu the sense that the MAXcs 
of the considered TP is violated) in the last TP and if diac - 



TP b empty then die algorithm prefers.to relax die TP to 
accommodate die node (see lines 15 to 17 in Figure 4). 

When finding Jf a node to be mapped can share an hU 
instance already existent in the considered TP, the algo- 
rithm binds die node/opmcion to the FU instance widt 
lower lengd) padi of nodes sharing it (line 14 m Figure 5). 

The steps of the algorithm shown in lines 32 to 37 of 
FIgiiTO S jtTR ri!l;iteri to the cnse diat when a node dnejj not 
fit (widi or widiout resource sharing or with relaxation of 
MAXcs) in the last TP a new TP is added to (he existent 
sot of TPs and the node ia mapped to It. 

4 Experimental results 

All the algorithms considered in this section have .been 
iinplemented widi die JavatM lan^a^. All the ex«euttcns 
of the algorithms were conducted on a portable PC (Pen- 
tium-II @366MHz, 196Mb RAM) with the JIT compiler 
fmcgrdtcd in dio JDK1.2 luuiiliig oit Wiadows98. 

The algorithm proposed in this paper has boon tested 
with a number of representative examples (widi variable 
complexity) and, whenever possible, the VKults obtained 
are compared with other approaches. 

The SEWHA Is die auto regzession filter prefiented in 
[13], HAL ts the loop body of the differential cquadon ex- 
ample [14], £WF is a digital HAb order elliptic wave filter 
[15], FIR is a 12-tap fim'tc impulse intrar, and Miit4x4 cor- 
responds to a fully parallel muldplication of two matrixes 
widi 4x4 integer elements each one. Mac4x4 Is used as an 
example of high pperadon level parallelism degree. 

We consider for aJl the experiments an execution la* 
tency of 1 clock cycle for each adder and 2 clock cycles 
for each multiplier. For die number of resources needed 
for each FU wc consider I cell Ibr each adder and 4 cells 
fhr ftach miilbplier. Table II shows the main characteristics 
of the considered examples (number of operations of each 
total number of resources and die critical path 
length)* 

Table IL Characteristics of the exannples. 



* This Ittvolvcfl the move of d node frpm a TP to the ftst find ibe map 
ofUie node lesdy co rhol TP. 



Example 


Nvmberof 
operaitons 


Toial 
number of 
operetloi>3 


Total 
number of 

(ceUs) 


Critical pad) 
length pf the 
ore (clock 
cycles) 


1 Exl 


1 (2,4) 1 6 


12 


4 


1 EWF 


1 (8,26) 1 34 


58' 


17 


1 FIR 


1 (12,11) 1 33 


59 1 


6 


1 HAI. 1 


1 <4,<!) 1 10 


22 


7 


fSEHWAl 


1 (16,12) 1 28 


76 1 


11 






112 


304 1 


4 1 
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4.1 ShftiriDg versus not dbdring 

Table HI shows results for die consideted examples. 
Our* and Our** identify results obtained by applying ihe 
proposed algorithm. Our*- considezs resource sharing for 
both ddder and roiilHplfftr tinitK, Anrl Our** only considers 
resource sharing for multiplier units. #cs identifies the 
execution latency (number of ciocK cycles) and Up the 
number of TPs. Each solution related co our algorithm was 
obtained in less than is of CPU dme. 



Table 
pies* 



ill. Results obtained for the exam- 



Examplo 


Rmax 


Approach 


ASAP 


SA 


Our* 


Our»* 


#p 




#p 




Up 


#cs 




#cs 




6 


2 


s 


2 


•f 


I 


4 


2 


4 


S£HWA 


6 


18 


36 


IS 


35 


i 


34 


6 


34 


10 


9 


24 


9 


19 


I 


18 


6 




15 




18 


6 


15 


1 


15 


s 




HAL 




$ 


n 


5 


9 


2 


10 


3 


10 


10 


3 


10 


3 


7 


2 


7 


3 


7 


EWF 


$ 


12 


26 


12 


23 


I 


» 


9 


25 


10 


6 


21 


6 


22 


5 


18 


7 


18 


15 


4 


19 


4 


IS 




17 


4 


18 


FIR 




H 


28 


M 


27 


1 


20 


4 


27 


10 


7 


16 


7 


IS 


1 


15 


4 


IS 


IS 


5 


12- 


5 


11 




11 


3 


11 


MAT4x4 


6 


72 


136 


72 


I3i 


1 


130 


21 


130 


10 


37 


69 


37 


6S> 


J 


66 


17 


66 


15 


25 


47 


25 


46 


1 


46 


10 


46 


20 


16 


29 


16 


29 


2 


29 


4 


20 



The SA results were obtained with a simulated anneal- 
ing versilon to do temporal purLiiiuiuiig witliouc resource 
.sharing proposed in. [16]. Here, the algorithm is tuned to 
optimizing the overall acecution time (the algorithm can 
also exploit the tradeoff bctwcxn execution time and 
communication costs). The ASAP results refer to the level- 
ing technique proposed in [1 1]. 

Only Mat4x4 needed to start wiU> the number ol I FS 
obtained by the ASAP approach to achieve the best solu- 
don. For all the other examples, the best solution was cb- 
tained starting with an initial number of TPs equal to the 
number of levels of the DFG. The results for Mat4x4 in 
Table 111 were coUecied disabling the update of die list of 
nodes ready for each node mapped (the list is updated only 
when it is empty). It is strongly recommended to disable 
the i^daie option fnr Kxamples with high-level degree of 
parallelism and a small critical path length. 

The values in bold in the 6^ and columns of Table 
X^^ show the minimum execution latt-ncy fof rhft datapaths 

obtained by the considered approaches (nor considering 
configuration times). The values in bold in die 10^ column 
rcprcseriL that, even wjUioux considering sharing of addora, 
our algorithm returns solutions with exccudon latencies 



equal Co the execution latencies obtained sharing all the re- 
sources (8^ column), despite the fact that those solutions 
need more TPs. . 

When considering resource sharing for all FUi, a 
minimum number of TPs (only 4 cases of Table III needed 
niorp than one TP to produce a roiniixiiun exccudon time) 
seems to ensure solutions widi lower execution latencies 
than the obtained by doing temporal partitioning with 
ASAP or SA ibr niujufiiy Mr Lin; csAuiuplcs (only vm 
case is not as good as SA). Note that when all the FUs can 
be shared and the resource overhead to implement sharing 
is not taken into account an empirical observation telJ us 
that the solutions with tower execution latency are those 
widi only one TP. This is expected by the ftct diat a new 
TP produces an equal or worse effect than sharing FU in- 
stances on the overall execution latencies because all the 
nodes in that TP can only stait executing alter the end of 
die execution of die TP immediately before. 

When sharing of adders is not considered the algorithm 
Js capable to find 13 solutions without inferior execution 
latency. 

4.2 Ei^lfH&ie number of TPs 

An cxploitadon of die overall execution latency versus 
the number of TPs is shown in Figure 8. Those results 
were produced by calling the algorithm several times, each 
time starting with a different mitial number of TPs from a 
mnge of 1 to 15. The cxploitadon has been done in ap- 
proximately SAb of CPU time. All the solutions use only a 
single TP and the besc result (execudon latency equal to 66 
clock r.yf5t«5) has, hfien flchieved when the al^rorithm 
started with 8 TPs. The results without considering sharing 
of adders are shown in Figure 9. The algorithm exploited a 

jUTiQp of TP3 from 1 to 26 and die mSninfiwi execution la- 
tency achieved was 66 clock cycles (solution with 21 TPs). 
Based on those results we can select a solution that mini* 
mizcs the global eAtfcuUou laleuvy laKlu^ iulu uwuuut ili« 
ceconilguration times (see equation (2)). 



ji 75 

a 05 



3 6 7 9 11 13 
Initial number of Temporal PartfUons 



1$ 



Figure B. Execution latency versus the Ini- 
tial numlier of TPe for nnuit4x4 obtained Dy 
the proposed algorithm, when RmuupIO 
(sharfng of adders and multipliers). 
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From the results presented so fkrwe may conclude that 
sharing FUs can reduce the number of TPs without in- 
creasing the overall execution time. Moreover^ a xninimum 
number of TPs can be a priority, when an FPGA with sig- 
nificant reconfiguration tiones is used. Due to its low com- 
puUiUuxiul conip1t$AJ(yy Llic alguriilim cua be ubcii to c;q>luit 
the design $pacB based on the tradeofTbetween the number 
of TPs and the overall execution latency. 



« 30 

i 20 

I 

1 " 

IE 5 



-#7Ps 



-Ekec. 



71 



0 i i I I ) I fl M M 11 M i 1 I l - H - M -H W 
1 3 $ 7 9 11 13 15 17 Id 21 23 29 
btiUHl Nwiiber uf TPs 

Figure^ 9. Fxi^i^iiHon inteney and the final 
number of TPs versus the initial number of 
TPs obtained by the algorithm for Mult4x4, 
when RmaiPIO (no shaHne of addere). 



43 Comparison with oOter schedulers 

At this point a question may occun is the algorithm 
vurnpeilLlvc when a single TP is envlaagod? Table JV 
shows results for EWF and SEWHA, considering various 
sizes for the available resources (Rmax)* The schedules ob* 
tained by the proposed algoxiUun cuitsiiitfiiiig only one TT 
are shown (see the column). The number of nssources 
used for each type of FU for each solution is also shown 
(last column). 'Tixed'' refers to results collected liom the 
state-of-the-art schedulers [I7][18][19] and represent op- 
timal O'dentlfied with or near-oplimal schednling re- 
suits (without enter into account with temporal partition- 
ing) for the specified constniint on the number of FUs for 
«inh type of operation (see the l*"* column). The results 
show that our algorithm is efUcjent, even when we are in- 
terested on a final solution with a single TP. 

The result labeled with n is achieved without an in- 
cremental update of the Ust of the nodes ready to be 
mapped. This xKsult shows ^t the algorithm did not sldp 
from a local minimum^ 6h»ee at least the result related fo 
Rmax'^'I^ should be achieved. The first 4 results obtained 
for SEHWA consider the increasing order of the ASAP 
values an die second key (there la no evidence to suggest 
when it is better to use the decreasing or the increasing 
ASAP values as the second key). 

'f he numOer of each ml Instance allucuied by uui algo- 
rithm for each Rmax constrauit only was different in two 



cases to the constraints used (widi total number of re- 
sources equal to J^iAAx) produce the near-optimal sched- 
uling result^ (see Table JV). Thercfoie^ il atsatus Ural our 
algorithm can also be used to a &st identification of die 
number of FU instances needed, considering a specific 
number of maximum resources available on die device. 

Table IV. Comparison of scheduBng results 
obtained for EWF and SEHWA. 





Approach 


Example 


Fixed ri7HJ8in<>l 


uur 


constraints 


iks 


constraints 




(><.+) 




(x,+) 




Rkux(#P»I) 








21* 


6 


!£3 


(1.2) 




0.3) 


18 


7 


22 


(1.3) 






214^ 


9 


22 


(1,5) 


(2.2) 


18* 


10 


20 


(2,2) 




(2.3) 


ig 


11 


IS 


(2.3) 




(3.2) 


18 


14 


18 


f2.6> 




(3,3) 


!?♦ 


15 


17 


(3,3) 




(1. 1) 


34+ 


5 


34 


(M> 




(2,1) 


IS^ 


P 




(:?- 1) 




(2,2) 


{8i» 


10 


le 


(2.2) 


SEHWA 


(3, 1) 




13 


17 


(3. 1) 


Of 2) 


1541 


14 


15 


(3, 2) 




(3,3) 


15^ 


15 


15 


(3,3) 




(4,1) 


16* 


i7 1 


16* 


(4,1) 




«4> 


11* 


- -is" 1 


11 


(4,2) 



5 Related Work 

As far as we know, the development of temporal partii* 

tioning ajgorithma ynaa fketly eonciderecl In [9][2]. The 

similarities of both scheduling on high-level synthesis [8] 
and temporal partitioning allow Ae use of common sched- 
Uling schemes ft>r puriUIuuiug. Some autliors, such as 
[9][10], have considered temporal partitioning at behav- 
ioral levels having in mind the integration of synthesis. 

In [9], a heuri snc based on a static list scheduling aigo< 
rithm» enJianced to consider temporal partitioning and par- 
tial reconfiguration, is shown. The approach exploits the 
dynamic reconfiguration capablliiy of t}ie devices, wliile 
doing temporal partitioning. 

In ri0ir20] the temporal partitioning problem is mod* 
eled in a ^ecified 0-1 non-linear programming (NLP) 
model* The problem is transformed to integer linear 
pm£pmmmlng (1LP) and the solution determined by an ILP 
solver. Due to the long execution times, this approach is 
not practical for large input examples. Some heuristic 
methods have been HRvelopt^ to permit its usability On 
larger input examples [21]. Kau] [22] exploits Oieloop ItS- 
sion technique while doing temporal partitioning in the 
presence of loops to minimize the overall latency by utili- 
zation of the active TP as long as possible. Sharing of 
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fimcdonal units is considered inside Casks and temporal 
partitionmg is perfomied at the task level. Design space 
exploitation Isperibzmt^d by tapuiaug to die temporal par- 
titioning algontiim different design soliitions fereach task. 
Such solutions are gcnci^ted by a blfih-level synthesis tool 
(constraining the number of l:Us of eaclj type). This ap- 
proach lacks a global view and is timo-consuming, 

The simplest approaches only consider temporal parti- 
tioning without exploiting sharing of FUs. In [1 IJ, both a 
temporal paititiofling algorism based on leveling the cp- 
eiations by an ASAP scheme and odier based on clustering; 
a number of nodes are used. The algorithm fills the avail- 
able resources in the increasing order of die ASAP levels. 
The selection of nodes; in the s?»mft l«vpJ in arWirftfy and 
the algbritiun switches to another TP when k encounters 
ihe first node that does not fit on the current TP. The ap- 
proach does not cpnsidarnoither comnriunications coi^tsnor 
resource sharingr In [23] another algorithm is presented 
that selects die nodes to be mapped in a TP with two dif- 
fcnml appiuaohes (one for satisfying parallcHsm ond an 
Other ibr decreasing communication costs). In [12]. an al- 
gorithm based on ihe extension of the ASAP or ALAP 
leveling schemes resonJng to die mobllliy Mfeawh node to 
select among the nodes has been consldexed. [12] also 
shows an aljgonthm that searches recursively in the list of 
ready nodes so that if a node cannot be mapped to the cur- 
rent partition, other nodes can be considered. 

[Ifi] considers both i^mimunication costs among differ- 
ent TPs that can occur and the overall execution time. The 
authors presented an extension to static list scheduling^ 
which pennits co the a]0niilitm f»nsitivi4r to the coramuni- 
cation costs while trying to minimize die overall execution 
time. The results presented, when compared to neai^ 
optimal solutions obtained with a simulated annfiftlingr jil- 
gorithrt mned to-do temporal partitioning while minimiz- 
ing an objective function, that integrates the execution 
time of the TPs and thu womnfiunication coats, rovwilod dio 
^dency of l9ie approach. 

[24] presents a method to do temporal parddonin^ con- 
sidering pipelining of the reconHguration and execution 
stages. The approach divides an FPGA into two portions to 
overlap tjie execution of a TP in one poition (previously 
leconiiguied) with the reconfiguration of me other portion. 

In [25] consn-aint logic programming is used to solve 
temporal pftrrltlnnin^, scheduline^ and dynamic module al- 
location. HowBven the approach needs a specification of 
the number of each FU before processing and may suffer 
of long runtimes* 

More related to our approach is the algoiitiim presented 
in [26]. A scheme based on the force-directed list schedul- 
ing algorithm that consideiS resource sharing and temporal 
partitioning Is shown. The algorithm tries to minimize the 
overall execution time, performing a taadeoff between the 
numbcrof TPs and sharing of tUs. However, the approucli 
adapted a scheduling algorithm not originally tailored to 



do temporal partitioning and lacks of a global view. In- 
stead, our approach proposes a novel algorithm matched to 
the combination of temporal partitioning and sharing of 
FUs that maintains a global view. 

6 Conclusions and Future Work 

In this paper wa h»vA piwtnrRd n nnw nnH ii.wfiil sil|»o. 
rithm combining temporal pBititioning, sharing of fiino- 
tional units, scheduling, allocation and binding. Unlike 
other appraaohvoj thio olgortdun mwQw thoco taske in a 
combined and global medicd. The obtained results, from a 
number of benchmarks, strongly confirm the efficiency 
iuid ttttc:i;Livtait»:> Of thc Idea, 

The low computation time achieved, when dealing with 
the presented examples, ^ows diat the algorithm is ^t 
and efficient and thus can be used on large examples. 

The inclusion of Aincdonal units with pipeline stages 
and the consideration of more than one implementation for 
a given operation will be considered In a near ftiture. An* 
other important issue is the overlapping of reconfiguration 
smd execution that should be considered by fiuure en« 
haxicements. Finally, aspects related to conditional paths 
and loops will also need to be focused of iiiture work. 
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A Method for CompUing Blgh-Levdl Language Progrmns to a ReconGgurable Dm^-Flovr Pro cessor 2 

1 Introduction 

Hiis document describes a method for compiling a subset of a big}i*level programming language (HLL) 
like C or FORTRAN, extended by port access functions, to a reconfigorable data-flow piocessor (RDFP) 
as describod in Section 3, L e., the program is transfornied to one or several configurations of the RDFP. 

This method can be used as part of an e^itended compiler for a hybrid architecture consisting of standard 
host processor and a reoonfigurabie data-flow coprocessor. The extended compiler handles a full HLL 
like standard ANSI C. It maps suitable program pares like inner loops to the cc^rocessor and the rest of 
the program to host frocessor. However, this extended compiler is' not subject of this document 

2 CompilatiQn Flow 

This section briefly dcsciibcs the phases of the compilnticn method. 

2.1 Frontend 

TTlic cuifipiler uses a sUuidaid fiujiieiid wliich U'anslates die Input program (e.g. a C pjrogi^atu) into au 
intemal format (IF) consisting of an abstract syntax tree {AST) and symbol tables. The frontend also 
perfomis weIl*known compiler optimizatinns as consranf. propagation, dead code eHmtnatinn, commnn 
subexpression elimination etc. For details^ refer to any compiler construction textbook like [1]. R g., the 
SUIF compiler [2] can be used for this purpose. 

2.2 Itamporol Partltioiiins 

Next the program^'s IF representation is partitioned into secdons which are executed sequendaily on the 
-RDFP by separate configurations. If the entire program can be executed-by one conflguratlon (fitthig on 
the given RDFP), no temporal paititioning is necessary. This phase genemtes reconfiguration statements 
which load and remove the configurations sequentially according to the original program's control flow. 

23 Configuration Generatiott 

Finally, the program sections detennined by the tempera) partitioning are mapped to RDFP configura- 
tions. This phase generates a program code or data structure which is then used to directly program the 
RDFR 

3 Configurable Objects and Functionality of a RDFP 

This section describes tine configurable objects and fancltionality of a RDFR A possible implementation 
of the RDFP architecture is a PACT XPP™ Core- Here we only describe the muiimum requirements for 
a RDFP for this compilation method to work. The only data types considered are multt-bit words called 
data and single-bit control signals called events. Data and events are always processed as packets^ cf. 
Section 3.2. 

2002"! 'J8 Kppvcpat V0.1 ConHdential 
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A Method for Compiling High-Level Language Programs to a RecoaBgvrable Data^FJow Ptvcessot 3 
3.1 Configurable Objects and Functions 

An RDFP consists of an array of configurable objects and a commanication network. Each object can 
be configured to parTorm certain functions (listed below). It pertonns ttie same function repeatedly until 
the configuration is changed. The anray needs not be completely uniform^ i. e. not all objects need to be 
able to pexform oil iuncdons* E. g,, a RAM function can be implemented by a specialized RAM object 
which cannot peifonn any other functions. It is also possible to combine several objects to a "macro*' to 
realize certain functions. Several RAM objects can* e. g. . be combined co realize a RAM fnnrtinn with 
larger storage. 

After a configuration has been removed, all infoxmation is lo&t« Only the contents (values) of a RAM are 
preserved during reconfiguration. 
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Figure 1: Factions of an RDFP 

The following functions mainly handling data packets can be configmed In an RDFP. See Fig. 1 for a 
graphical lepresentation. 

• ALU[opcode3: ALUs parform common arithmetical and logical operations on data. ALU func- 
tions Topcodes**) must be available for all opeiadons used in the HLL.' ALU functions have two 



'Otherwise pA>grams Gonlninulg operations widcb do not have ALU opcodes^ in tho RDFP must be excluded from the 
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data inputs A and B, and ono dam output X. Comparators have on event ou^ut U instead of the 
date output. They produce a l-eveot £rthe comparison is true, and a Oi-event otherwise. 

• CN'R A counter lunction whicii has data inputs LB, UB and INC (Jower bound, upper bound 
and hicrement) and data output X (counter value). A packet at event Input START starts the 
counter, and event inpui NEXT uaustas Uie generation of the next output valac (and output events) 
or causes the counter to terminate if UB is reached. If NEXT is not connected, the counter counts 
eommuoosly. The output events U, V. and W hnvn the. folIowinB functionality: For a coimrftr 
counting N tunes, N-1 event packets witti value 0 (0-evenis) and one event packet with value 
I (1-event) are generated at output U. At output V, N O-events are generated, and at output W 
N O-events and one 1-event ate created. The l-cvent at W is only created after the counter has 
teimhiated. i e. a NEXT event packet was received after the last data packet was outpuL 

• RAM[sizeI: The RAM function stores a fixed number of data woids ("size"). R lias a data input 
RD and a data output OUT for reading at addi«ss RD. Event output ERD signals completion of 
the read access- For a write access, dam Inputs WR and IN (addi«s$ and value) and data output 
OUT is used. Event output EWR signals completion of the write access. ERD and EWR always 
generate O^vents. Note fliat extenial RAM can be bandied as RAM functions exactly like intetnal 
RAM. 

• GATE: A GATE synchronizes a data packet at input A back and an event packet at input E. When 
both have airived, they are both inputs consumed. The data packet is copied to output X and the 
event packet to ou^utU; f .auuiup 

• MOX: A MUX function has 2 data uipots A and B. an evmit Input SEL, and a data output X. If 
SEL receives a 0-packet, input A is copied to outpat X and input B discaided. For a l-packet. B is 
copied and A discarded. 

• MERGE: A MERGE function has 2 data inputs A and B, an evem input SEL, and a data output X. 
If SEL receives a O-packet, input A is copied to output X, but input B is not discarded. The packet 
IS left at die input B inst^ For a l-i>ackec B. is copied and A left at the inpuL 

• DEMUX, A DEMUX function has one data input A, an cvont input SEL, ond two data oaq>uts X 
and Y. If SEL receives a 0-packet, input A is copied to output X, and no packet is created at output 
Y. For a 1-padcet. A is copied to Y. and no packet is created at ouQ>ut X. 

• MDATA; A IWDATA function multipjicates data packets. It has a data input A, an event input 
SEL, and a data on^ut X. If SEL receives a 1 pocket, a data paoket at A i« consumed and copied 

to output X. For all subsequent O-packets at SEL, a copy of the iojnit data packet is produced at 
die output without cnn!»umine new packets at A. Only if another I-pacfet atmves at SEL, the next 
data padoet at A is consumed and copied.^ 

• INPORT[uamc]: Receives data padsets from outside the RDH> dirough input port "name" and 
copies thera to data output X. If a packet was received, a 0-event is produced at event ampat U, 
rnn. (Note that this function can only be configured at special obgects cmnected to external busses.) 

• OUTPORT[nanje]: Sends data packets received at data input A to the outside of the RDFP through 
output port "name". If a packet was sent, a O-eveot is produced at event output U, too. (Note t^t 
this function can only be configured at special objects connected to external busses,) 

supponea mx subset or substJiuioJ by "timwos- uf VAlMiiig fwiciions. 

Note that Uiis can be Implcmontt^d by a MERGE with special piepenies on XPP. 

2092.1,18 ^vcpat VOJ Conadeatial 
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A Method ibr Compiling Higll-Uvel Luggage Pmgnmis to a Rec onSg^^^ 5 

Additionally, the following functions manipolaie only event packets: 

• amXER, l-HLTER: A HLTER has an uiput E and an output U. A O-FEJER copies a O-event 
flrom B to U. Iwt l-BVENT^ acE are discanted, A I-FILIBRcoples I -eraits and discards O-evtaits. 

• INVBRTBR: Copies aU events trOtn input E to ouqmt U bat Inverts its value. 

• 043ONSTANT 1-CONSTANT: 0-CONSTANT copies aU events from input E to outpnt U, but 
Changes fliem to value a 1-CONSTAPJT changes aU to value 1. 

• EOOMBiCtombines two ortnore inputs E1,E2. E3....prodaciog a parJrat «t ontpiU U. Theoutput is 
a 1-event iffone or more of the input packets are l-evenls (logical or). Apacketmust be available 
at aU inputs before an ouput packet is produced.^ 

• ESEQIseql: An ESEQ generates a sequence "seq" of events, e.g. "0001", at its output U, ff it 
has m i^t START, one entire sequence is generated for each eve«t pactet anaving at U- TTie 
sequence Is only repeated if the next event airives at U. However, if START is not connected. 
ESEQ constantly repeats the sequence. 

3.2 JP^ckefr-based Commmiication Network 

Th« coTnmnnicntinn network of an RDFP can connect an outpute of one object (i.e. its 
tion) to the inpuKs) of one or several other objects. TTiis is usuafly adueved by busses and switebes By 
Pla4g the ftactions properly on the objects, many functions can be connected arbiiranly up to a hnnt 
Kllby the devic^. As mentioned above, all values are communicated as pa^cK. A separate 
coJmiuniciion network exists for data and event packets. Th^ pacte^ '^*'^:S>S*i;Sf *e 
data-flow lasluon. L e., the function only executes when all input packets are avaU^le (apart flrom me 
exceptions where not all inputs ate required as described above). The fhnclion ^o StaBs rf the last output 
^ckc. h«s not been consumed. The«foi« a data flow graph mapped to an RDFP aelf-ayochronu»s .to 
Secutioo without the need for external control- Only if two or mors fimction-outpus are connected to 
die same function input <N to 1 connection), the self-synchronization is disabled. The use has to emuie 
Sat otSv one packet airives at a time. Otherwise a packet might get lost, and the ^ue resulting from 
^mblning two or m<«e packets is undefined. Themfoie this should be avoided. Howevei; a funcuon 
output can be connected to many function inputs (1 to N connection) without problems. 

fjiere are some spedal cases: 

• A function input can be preloaded with a distinct value during configuration. TTus packet is con- 
sumed like a noimal packets coming from another object 

. A function input ran be defined as eonsumt. In tins case, the packet at the input is reproduced 
repeatedly for each ftinction execution. It is even possible to connect an output of another fimcuon 
to a constant input. In tiiis case, tiie constant value is changed as soon as a new packet arlyes « 
tiie inpuu Note that tiiere is no self-synchronization In fliis case, too. The fhnctioo is not stalled 
until flie new packet arrives sine? tiie old packet is still used and leproduced. 
aNote that iMs fimcdon is in^Iememed 1^ the BAND op«rai« on iJ»e XPR 
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An RDPP requires register delays in the dataflow. Otherwise very Ions combinational wrf^.,* * ^ 
daoncus feedback Is possible. We assume that delays are InsenSi ^iTrp^T^f 
formost ALUs) and in s««ne xoutiu^ segments of Jcommuni^o^n^X^ 

4 Temporal Partitioning 

Tlui details qfnmpondl^nitioransit^ediob^ inserted Jh^ Canlo^'s docmtr^, 

5 Configuration Generation 
S.t Language Definition 

The following HUL features are not sopponed by the method described here: 

• pointer operations 

• libraiy calls, opemtin^- sysiftm naiK (including standaiti T/O functions) 

• '«:~'^'*«»totypBintegen feteger values are equivalent to .fete packets i 
the RDEE Aireys (jpossibly inuIti.dimensional) are the only composite data types cmSSSa 

Tin following addiliooa] features are supported: 
5^ Mappii^ of mgh-Levcl Language Construds 

JS^ri«„^*r*«'^f ^ * contTOVdata-flow gi^ph (CDFG) consisting of the RDPP 

SSSTam ^""1™ ^^^^"^ P'^^^^^^S starts, all HLL progrem airays are ™S to 

^ An art^y X is mapped to RAM RAM(X). If severel^ys il m^JS^fth^ 

same RAM, an offset is assigned, too. ll»e RAMs are added to an iniUally empty CDFG Th^r^LZ, 
enough RAMS of sufficient sfe« /or mU program ar^ys. ^ ^ 

CDFG is generated by a traversal of the AST of the HLL program. The foUowine two nieces of 
informatioa are maintained at every progra m poiot^ aurms He iravereal: Pieces of 

^^.''^''B^Pi^m points axchelw6en wo statements or before the bcriiminE or iifter the end of ^ 
W» a loop Of a ooactitionai statement. DcsHmmg or aner me ena or a program jtniciiac 
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. STARToointstoaneventoutputofan object, it 4eUveis a 0-went whenever the program executicm 

toX CT>FG (Itdelivenj a 0-event immediately after configuration.) STARTiniUaBy pomts 
^^^Ct^SToHlgnal geae^ted after a progiam part has fi-;^^^ - 
new STAJKT signal for the next program part or signals tennination of thd entire program. 

. VARLKT is a list of {yariable, object-output} pairs. The pairs map integer variables (no arrays) 
rf^FQ obiSt's ou^ first pair for a variable In VARUST contains the output of the 

S,^Sh Scrr;a^ oflif variable valid at the program point. New pairs are always 
S to ^t of VARLIST. The expmssion VARDEF(var) refers to the the object^utpui of 
the first pair with varUAU var in VARUST.' 

Thcfunuwii* ^bsections *ystc„>atlcally list all IILL components and dcscribo how they «^ p«,c«sed, 

thereby altering the CDFG. START and VARUST. 

S.2J Integer JExpressions and As^gnments 

rltmi or scheduUns is needed. Therefore processtag these assignments does not access or alter 5TAK1 . 
control orsctwooung » "™ exoosed in the DAG representation of the program (!]) are 

^Z:^^tSZEofy^Si^^Z Senmencs'synchroni^ themselves through the 

All assiemnents evaluate the right-hand side (RHS) or source expression. This evaluation result in a 

{LHS, resuItCRHS)} which is added 10 the ftoot of VARUST. 
../jheslmplest.statementisaconstanLassignedloan.hit^gw:'? - -• - - - - 

Object" which only holds tho value, but does not refer to a CDFG object. «ow v/VKi^iirv , 4 
subseqent program points before a is redefined. 

integer asslgmnents can also coml>toe variables already defined and consumu.: 

L&elsT L^Sisalready converted to an expression tree. TWs tree is uansformedtoa 
SSd and ne^ CDFG objects (which are added to the CDFG) as follows: Each operator (intern^ node) 
SS^SS^fTuStmeTbvTnALUwiththe^^^^^ to die operator m die o-ee. If aleaf 

^dTiTa c^nsS ftf >£l^s^ut is directly connected to that constant If a leaf note is an mteger 
node IS a ^''^^f^l^^^r^ , e VARDEF(var) Is retrieved. Then VARDEF(var) (an output 

nSye^S^^^^^^ 
^uS^^Slgfo2n>otopc«Uorin th.^ 

STbia method of usii« a VARUST is adopted flrom the ThUBmogriJicr C cofflpaw PJ. 
<Noto Out wa use C i^tax for ihefoIIowinB examples 

Con/Jdentia/ 
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a new pair {LHS, resuJt(RHS)>.is added to VARLIST. If she two assignments above are processed, the 
CDFG with two ALUs in Kg. 2 is created.^ Ou^Mits oocuuring in VARLIST are labeled by Rcnnan 
numbers. After these two assignments, VARLIST a [{b, I>, {a, 5>]. (The front of the list is on the left 
side.) Mote that aU Inputs connerted to a constant (whether direct from die expression nee or retrieved 
ftom VARLIST) must be defined as constant. Inputs defined as constants have asmaU c next to the inout 
arrow in Fig. 2. *^ 

5.2.2 Conditional Int^erAssfsnments 

For uoudiauiial iT-lIitat-else statements containing only integer assignments, objects for condition cval- 
nation are created first. The object event output mdictating die condition result is kept for choosing 
Uia cowecl branch resoJr laf«r. Next, both branches are processed in parallel, using separate conies 
VARUSTI and VARUST2 of VARLIST. (VARLIST itself is not changed.) Finally, for all variables 
added to VARUSTI «>r VAROST2, a new entry for VARLIST is created (combhiaUon phase). The valid 
definitions from VARUSTI and VARLIST2 ate combined widi a MUX function, and die correct input 
is selected by the condition result. For variables only defined m one of the two branches, die multipiexer 
uses the result retrieved from the original VARLIST for the odier tnanch. If die origfaial VARLIST does 
not have an entry for diis variable, a special "undefined" constant value is used. However, in a fiincfion- 
ally Aoirect program this vahic will never bo used. As on optimization, only variables Uvc [I J after the 
if-then-else stzuchirs need to be added to VARUST in die combJnatiQn pibase. 
Consider the foilowjng cauuuple: 

i - 7,- 

a = 3; 

if (i < 10) { 
a = 5; 
c =. 7; 

... . • } ....... 

else { ' ' 

c = a - i; 
d ■= 0; 

} 

Fir. 3 shows the resulting CDPO. Before die if-then-else consbuct, VARLIST = r{a, 3}, {i 7)] After 
processing die branches, for die then branch, VARUSTI = [{c. 7>. {a, 5}, {a, 3}, {i, 7}], and for the 

no {a^A ^U^'Sr^l^'^' ^' ^ VARLIST = f {d, H}, {c. 

Note that case- or swhch-staiemenis can be processed, too, since tfiey can - without loss of generality - 
be converted to nested if-fiien-eise statements. 

Th.ls prnwKsins of conditional statements doesn't need explicit control, eidier. Both branches are exe- 
cuted in parall el and synchronized by die data-flow. 

jNoW cbet tha iAiiut and ouiput names cn« hft ri«fnftftri from fheir pwdflnn. cf. Fie. 1. AIiiA note ihai ihft ermmilAr Trtm- 
tend would rtomially have substituted the second assifinment by b = 13 (constant propagation). For the SiinpUcity of ihis 
explonauon, no frontend opUmizations arc considered in this and the following examplsc. 
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5JL3 Anay Accuses 

Jn contrast to the above sections, may accesses have to be controlled exirfiddy to maintain the conect 
execution order. For a read access the read address is connected to data mput KD, For a write access, 
the write address is connected to data input WR and the write value to iiipui IN. AUs these inputs are 
connected to their lespcctivc souiccs through a GATE controlled by START. A STOP event sigtudllng 
completioo of the amy access must be assigned to the ST/^T variable. Smce there's only one START 
evpnf racfcet available, only one array access can occur at a time, and the execution order of the onfiinal 
program is mamtained. This scheduluig scheme is similar to a one-hot eoniroOer for digital hardware. 
If Q RAM is wad and written at only one program point, the ERD or EWR ouipats can be used afi STOP 
events However, if several lead or several write accesses (from different program points) to the same 
RAM occur, each access produces aERD or EWR event, respectively. But a STOP event should only be 
executed for the program point currently executed, the current access. This is aeteewed by connecting 
the START signals (Ic connected to the GATEs) of aU other accesses with the im^erted START 
sisnal of the cuixent access. The resulUng signal produces an event for eveiy access, but only for the 
cmient access a 1-eveoL This event is combmed (ECOMB) with the RAM's ERD or EWR access. Tlie 
ECOMB'S output wiH unly occui altei the access Is completed. Because ECQMB OR-combutes its 
event packets, only the cuiient access produces a 1-evenL Next, this event is filtered witli a l-HLTER 
and changed by a O-CONSTANT. resulting in a STOP signal which produces a O-event nnly after the 
cuirent access is conylettd as required. See bdow for an example. 

For compadns the RAM addroascs, the compflor froniend'a standard transfbimation for army ocoewcs 
can be used. The only difference is that the offtet with respect to the RDFP RAM (as deterarnied m the 
initial mapping phase) miwst be used. 

For several accesses, several sources can be connected to the RD, WR and IN inputs of a RjAl. TMs 
disables die self synotaonization. However, fiince only one access al a rime can happen, die GATBs only 
allow one d^ packet to arrive at ttie inputs. ■ 

For xom accesses, dw packets at the OUT ouqjut face die same problem as flic ERD evem packets: They 
occur fof^^ re7d abcess,'bttt must only be used (and forwarded to subsequent operators) for the current 
access. TTiis can achieved 1^ connecting the Ol-JT nulpnr via a HRMTIX fimctlnn. The Y niitpnt of 
the DEMUX is used, and the X output is left unconnected. The it acts as a selective gate which only 
foiwaids packets if its SEL input receives a Invent, and discards its datainpoiif SEL receives a 0-evenL 
tSTS created by the ECOMB described above for the STOP signal creates a 1 -event for the current 
access, and a 0-event otherwise. Usmg it as the SEL input achieves exactly the desiied funcitonaHty. 
To avoid redundant read accesses, RAM reads are also registered in VARLIST. Instead of an integer 
variable, an array element is used as first element of the pair. However, a change in avanabJe occwrmg 
in an array index invalidates the information in VARLIST. It nnist dien be removed from it. 

The followfaig example shows two read accesses: 



X = aCU; 
y =• a[j]; 
z = X + y; 



Fig. 4shows the resulting CDFG. Inputs START (old), I and j should ^^'^"''^"i!^,^^^^^^^^^ 

ticL resulting ftom the program befoi* the array reads. The signal indicating aieilOFotUiBtlrstacce5;s 
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is nmiiced by STOPl. Write acxesses use the same control events, but iostead erf one GATE per access 
for the RJO inputs, one GAIE for WR and one gate for JN (with the same E inpat) aie used. Also no 
ou^uts need to be handled. 

¥ig. 5 shows tbe access a [i J « x; for tbe simple case Aat ifae RAM is only wiitten once, I.e at one 
program point. 

This scheme executes RAM accesses conectly. but not veiy fast since all accesses zxe synchionized even 
if this is not neeessaiy. The following optunizations aie possible: 

• Only accesses to the same RAM are synchronized. Accesses to different anays can occur concur- 
renfly or even in changed order. When there is a data dependency, the accesses self-synchronize 
automatically. This can be achieved by maintaining a s^arate START signal for every RAM At 
the end of a basic block H j, aJJ these SlARr signals most be combined by a ECOMB to provide 
a new signal for the next basic block. 

• For sequences of either read accesses or write accesses (not mixed) within a basic block, it is 
, possible to stream data hito the RAM rather tfian w^ting for tbe previous access to complete. For 

this purpose, a combinadon of MERGE functions selects the RD or WR and IN inputs in the order 
dictated by the sequence. The MERGBs must be controlled by iteradve ESEQs guaranteeing that 
die iiipuis arts only rorwaided iu lliiis oidci. Tlieu only (lie Hj&l access iii (ho sequeuue needs to 
be controlled by GAlBs, the other GATEs can be removed to increase thnwghput. Similarly, die 
OUT outputs of a lead access can be distributed mom Rtferienrty fi>r a srqDence. A combinadon 
of DEMUX funcdons with the same ESEQ control can be used. For read accesses, the generation 
of the last ootpuc can be sent duougji a GATE (widiout the E iiqnit ccmnected), Oereby producine 
aSTQPevenL 

Fig. 6 shows the following three array reads in tbe optimized Sashifm. 
X » alij ; 

y = a£j]; ■• . 

z - aTkl; 

5JL4 Input and Output Ports 

■ 

Input and output ports are proeessod similar to veeton' ooees^es. A rend from an ii^m port U like an 
array read Avifliout an address. The input data packet is sent to DEMUX functions which'send it to the 
fioircnt snhspjiiiRnt operators. The STOP signal is generated in the same way as described above for 
RAM accesses by comWning die INPORT's U ou^ut with the cunent and other START signals. 

Output ports controJ the data padcets by GATEs like array write accesses. The STOP signal Js also 
created as for RAM accesses. 

5.2^ General Conditional Statements 

Conditional statements containing eiflier array accesses or irmer loops cannot be processed as described 
in .<;flctinn 5.2.2. Data packets must only be sent to the active branch. TherBfoi«. a dataflow analysis is 
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required. Used sets »ad defined sets lU of both branches must be computed. For all variables in either of 
these sets DEMUX functioiis controlled by the IF condition are Insetted. They route data packets only 
to the selected branch. New lists VARUSTl and VARLIST2 are compiled with the respective outputs 
of these DEMOX fimctioas. 1 he then-branch is processed with VARLISTl, and the else branch with 
VARUST2. Finally, the output values are combined. Since otdy one brtnch is ever activated diere 
win not be o conflict dno to two paetets arriving slmultanuously. Hie comUmitious will be added to 
VARLIST after die condidonal statement 



5^ FORLotqps 

A FOR loop is controHed by a counter CNT. The lower bound (LB), upper bound (XJB), and increment 
(INQ expm5«jonR am ev.iliwtert like any expressions (see Sections 5.2.1 and 5.2.3) and connected to the 
respective inputs. The START input is connected to the START signal. TTie new START signal (after 
loop execution) is CNT»s W ouq>ut sent through a 1-FILTER and O-CONSTANT. (W only outputs a 1- 
event after the counter has lenninated.) CNT's V output produces one 0-event for each loop iteratioo and 
is therefore used as START for die loop body. Finally, CNT's NEXT input is connected to the START 
signal at the end ofttie loop body (i.e. its SltJF signal.) m* assures that one iteration only starts after 
d)ie pervious one has finished. CNT*s X output provides the cnnent value of the loop index variable. For 
FOR loops, dataflow mialysis isxcquircd. too. 

For all variables defined in flie loop body and Uve at its beginning, a combination of the input value (from 
VARLIST ai loop cniiy) and nf^dtack value ftran the end of the loop Is created. Next, each one of 
these signals is connected to a DEMUX which is controlled by CNT's W output It sends the input or 
feedback values back to the loop body (0-event) during loop execution. The VARLIST used in the loop 
body contains these DEMUX outputs. After loop twrnination, die mput of feedback values are sent to 
the output of the loop (l-event). The varKst at the end of the loop contains these DEMUX outputs. Inputs 
not defined hi the loop are taken from die input VARLIST. 

The procesising of die loop body requires some .special considerarion. Data packets from vaiiabl^defined 
outside die loop but only used inside die loop (notredefine<0 do not lead to dio creation-of a feedback 
signal as explained above. Therefore only one packet is available (unless it is a constant), but it is 
consumed in each loop operation. This would stall die loop operaaon ftom die second itffl-abon <yriw^^ 
•nius it is necessary to multiplicate die packet for each loop operation. This is achieved by a MDAIA 
fbncdon wiili the SEL input connecied to CNT's U output. 
These metfiods aflow to process axbitiarily nested loops and conditional slaiements. 
Fig. 7 shews the generated C3QFG for die foBowin for loop. 

a = b + c; 

for (1=0; i<=10; ( 
a = a + i? 
x[il « k; 

> 

Noto that only ono data packet arrives for variables b, c und k, and one fined packet is produced for a 
(out). No GATES are inserted for the RAM write accesses since the packet generation is connroUed by 
the counts anyway. 
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5:2,7 WHILE Loops 

Wlffl^ looj» are processed stoiJarJy. The STOP signal (new START signal) is generated from the loop 
^dihon. fed through a O-WUJ iiR. When the loop finishes, an additional sfgnS (similar to the CNT? 
W omput) must be generated which controls the DJEMUXes to generate an output 

5^ ParaiOdizaiSaa, Vectwization and Fipelinhig 

The method described so far generates CDFOsperfonning the HLLptogram's fiwctionality on an RDFP 
However, the program execuifon is unnecessarily sequendailzed by die START signals. In many cases 
this J5 too resn-ictive. Several optimizations are possible. 

Independent loops (operating on different variables and anays) need not be sequentiaMzed They can 
use the same STAKT signal, and operate independently. After executfon. tfieir STOP signals must be 
combiiicd by ECOMD,forn»ia(g a new STABT signal for die subsequent prog^ 
In some cases, loops can be vectorized. This means that loop iterations can overlap, leading to a i>iDelined 
data-^ow through the operators of die loop body [4]. This technique can be easily appUed to the method 
described here. Por FOR loops, die CNT's NEXT faiput is removed so that CNT counts continuously 
thereby overlapping the loop iterations. Since vectorizable loops have no memory access conflicts the 
read and write accesses to die same RAM can also overlap. Especiafly for dual-poited RAM dlis leads 
to considerable perfonnance improvements. In tiiis ease separate START sixnahi must not only be main- 
tained for eadi RAM, but also Separately for read and write accesses. 

Finally, loop transformadons lite loc^ unxoning, loop distribution. loop tfling or loop mwsina f4J can 
be ^lied to mcrease die paralldUsm and improve pesfbmxance. 
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CLAIMS 

1. A method for partitioning large computer programs and or 
algorithms at least part of which ia to be executed fay an 
array of reoonfigurable unlfca «ueli as Atos, 
coii«jrising the steps of 

definiixg a maximura allowable size to be mapped onto the 

partitioning the program sueh that 4ta separate parte mi- 
nimize the overall eKeoution tima anU providing a mapping 
onto the array not exceeding the waxiiUMm allowable aiae. 

2 A device for partitioning larga computer programe and o» 
algorithma at least part of which is to be executed by an 
array of raconfigurable units such as ALUS, 

means for defining a maxim^ allowable size to be wappad 
onto the array, 

aiean« for partitioning the program such that ite separate 
parts minimize the overall execution time and for provi- 
ding a mapping onto th^ array not exceeding the maximum 
allowable size- 
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